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Abstract 

Bayesian statistics is based on the subjective definition of proba- 
bility as "degree of belief" and on Bayes' theorem, the basic tool for 
assigning probabilities to hypotheses combining a priori judgements 
and experimental information. This was the original point of view 
of Bayes, Bernoulli, Gauss, Laplace, etc. and contrasts with later 
"conventional" (pseudo-)definitions of probabilities, which implicitly 
presuppose the concept of probability. These notes show that the 
Bayesian approach is the natural one for data analysis in the most 
general sense, and for assigning uncertainties to the results of physical 
measurements - while at the same time resolving philosophical aspects 



1 



of the problems. The approach, although little known and usually mis- 
understood among the High Energy Physics community, has become 
the standard way of reasoning in several fields of research and has 
recently been adopted by the international metrology organizations in 
their recommendations for assessing measurement uncertainty. 

These notes describe a general model for treating uncertainties 
originating from random and systematic errors in a consistent way and 
include examples of applications of the model in High Energy Physics, 
e.g. "confidence intervals" in different contexts, upper /lower limits, 
treatment of "systematic errors" , hypothesis tests and unfolding. 
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"The only relevant thing is uncertainty - the extent of our 
knowledge and ignorance. The actual fact of whether or not 
the events considered are in some sense determined, or 
known by other people, and so on, is of no consequence". 

(Bruno de Finetti) 



1 Introduction 

The purpose of a measurement is to determine the value of a physical quan- 
tity. One often speaks of the true value, an idealized concept achieved by 
an infinitely precise and accurate measurement, i.e. immune from errors. In 
practice the result of a measurement is expressed in terms of the best esti- 
mate of the true value and of a related uncertainty. Traditionally the various 
contributions to the overall uncertainty are classified in terms of "statisti- 
cal" and "systematic" uncertainties, expressions which refiect the sources of 
the experimental errors (the quote marks indicate that a different way of 
classifying uncertainties will be adopted in this paper). 

"Statistical" uncertainties arise from variations in the results of repeated 
observations under (apparently) identical conditions. They vanish if the num- 
ber of observations becomes very large ("the uncertainty is dominated by 
systematics" , is the typical expression used in this case) and can be treated - 
in most of cases, but with some exceptions of great relevance in High Energy 
Physics - using conventional statistics based on the frequency-based defini- 
tion of probability. 

On the other hand, it is not possible to treat "systematic" uncertainties 
coherently in the frequentistic framework. Several ad hoc prescriptions for 
how to combine "statistical" and "systematic" uncertainties can be found in 
text books and in the literature: "add them linearly"; "add them linearly if 
. . . , else add them quadratically"; "don't add them at all", and so on (see, 
e.g., part 3 of [1]). The "fashion" at the moment is to add them quadrat- 
ically if they are considered independent, or to build a covariance matrix 
of "statistical" and "systematic" uncertainties to treat general cases. These 
procedures are not justified by conventional statistical theory, but they are 
accepted because of the pragmatic good sense of physicists. For example, 
an experimentalist may be reluctant to add twenty or more contributions 
linearly to evaluated the uncertainty of a complicated measurement, or de- 
cides to treat the correlated "systematic" uncertainties "statistically", in 
both cases unaware of, or simply not caring about, about violating frequen- 
tistic principles. 
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The only way to deal with these and related problems in a consistent way 
is to abandon the frequentistic interpretation of probability introduced at 
the beginning of this centm^y, and to recover the intuitive concept of proba- 
bility as degree of belief. Stated differently, one needs to associate the idea of 
probability to the lack of knowledge, rather than to the outcome of repeated 
experiments. This has been recognized also by the International Organi- 
zation for Standardization (ISO) which assumes the subjective definition of 
probability in its "Guide to the expression of uncertainty in measurement"[2]. 

These notes are organized as follow: 

• sections 1-5 give a general introduction to subjective probability; 

• sections 6-7 summarize some concepts and formulae concerning random 
variables, needed for many applications; 

• section 8 introduces the problem of measurement uncertainty and deals 
with the terminology. 

• sections 9-10 present the analysis model; 

• sections 11-13 show several physical applications of the model; 

• section 14 deals with the approximate methods needed when the gen- 
eral solution becomes complicated; in this context the ISO recommen- 
dations will be presented and discussed; 

• section 15 deals with uncertainty propagation. It is particularly short 
because, in this scheme, there is no difference between the treatment 
of "systematic" uncertainties and indirect measurements; the section 
simply refers the results of sections 11-14; 

• section 16 is dedicated to a detailed discussion about the covariance 
matrix of correlated data and the trouble it may cause; 

• section 17 was added as an example of a more complicate inference 
(multidimensional unfolding) than those treated in sections 11-15. 
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2 Probability 



2.1 What is probability? 

The standard answers to this question are 

1. "the ratio of the number of favorable cases to the number of all cases"; 

2. "the ratio of the times the event occurs in a test series to the total 
number of trials in the series" . 

It is very easy to show that neither of these statements can define the concept 
of probability: 

• Definition (1) lacks the clause "if all the cases are equally probable". 

This has been done here intentionally, because people often forget it. 
The fact that the definition of probability makes use of the term "prob- 
ability" is clearly embarrassing. Often in text books the clause is re- 
placed by "if all the cases are equally possible" , ignoring that in this 
context "possible" is just a synonym of "probable" . There is no way 
out. This statement docs not define probability but gives, at most, a 
useful rule for evaluating it - assuming we know what probability is, i.e. 
of what we are talking about. The fact that this definition is labelled 
"classical" or "Laplace" simply shows that some authors are not aware 
of what the "classicals" (Bayes, Gauss, Laplace, BernouUi, etc) thought 
about this matter. We shall call this "definition" combinatorial. 

• definition (2) is also incomplete, since it lacks the condition that the 
number of trials must be very large ( "it goes to infinity" ) . But this is a 
minor point. The crucial point is that the statement merely defines the 
relative frequency with which an event (a "phenomenon" ) occurred in 
the past. To use frequency as a measurement of probability we have to 
assume that the phenomenon occurred in the past, and will occur in the 
future, with the same probability. But who can tell if this hypothesis 
is correct? Nobody: we have to guess in every single case. Notice that, 
while in the first "definition" the assumption of equal probability was 
explicitly stated, the analogous clause is often missing from the second 
one. We shall call this "definition" frequentistic. 

We have to conclude that if we want to make use of these statements to 
assign a numerical value to probability, in those cases in which we judge that 
the clauses are satisfied, we need a better definition of probability. 
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Figure 1: Certain and uncertain events. 



2.2 Subjective definition of probability 

So, "what is probability?" Consulting a good dictionary helps. Webster's 
states, for example, that '^probability is the quality, state, or degree of be- 
ing probable", and then that probable means "supported by evidence strong 
enough to make it likely though not certain to be true". The concept of 
probable arises in reasoning when the concept of certain is not applicable. 
When it is impossible to state firmly if an event (we use this word as a syn- 
onym for any possible statement, or proposition, relative to past, present or 
future) is true or false, we just say that this is possible, probable. Different 
events may have different levels of probability, depending whether we think 
that they are more likely to be true or false (see Fig. 1). The concept of 
probability is then simply 

a measure of the degree of belief that an event will} occur. 

This is the kind of definition that one finds in Bayesian books[3, 4, 5, 6, 7] 
and the formulation cited here is that given in the ISO "Guide to Expression 
of Uncertainty in Measurement"^], of which we will talk later. 

At first sight this definition does not seem to be superior to the combi- 
natorial or the frequentistic ones. At least they give some practical rules to 
calculate "something" . Defining probability as "degree of belief" seems too 

""^The use of the future tense does not imply that this definition can only be apphed for 
future events. "Will occur" simply means that the statement "will be proven to be true" , 
even if it refers to the past. Think for example of "the probability that it was raining in 
Rome on the day of the battle of Waterloo" . 



vague to be of any utility. We need, then, some explanation of its meaning; 
a tool to evaluate it - and we will look at this tool (Bayes' theorem) later. 
Wc will end this section with some explanatory remarks on the definition, 
but first let us discuss the advantages of this definition: 

• it is natural, very general and it can be applied to any thinkable event, 
independently of the feasibility of making an inventory of all (equally) 
possible and favorable cases, or of repeating the experiment under con- 
ditions of equal probability; 

• it avoids the linguistic schizophrenia of having to distinguish "scientific" 
probability from "non scientific" probability used in everyday reasoning 
(though a meteorologist might feel offended to hear that evaluating the 
probability of rain tomorrow is "not scientific" ) ; 

• as far as measurements are concerned, it allows us to talk about the 
probability of the true value of a physical quantity. In the frequentistic 
frame it is only possible to talk about the probability of the outcome of 
an experiment, as the true value is considered to be a constant. This ap- 
proach is so unnatural that most physicists speak of "95 % probability 
that the mass of the Top quark is between . . . " , although they believe 
that the correct definition of probability is the limit of the frequency; 

• it is possible to make a very general theory of uncertainty which can 
take into account any source of statistical and systematic error, inde- 
pendently of their distribution. 

To get a better understanding of the subjective definition of probability 
let us take a look at odds in betting. The higher the degree of belief that 
an event will occur, the higher the amount of money A that someone ("a 
rational better" ) is ready to pay in order to receive a sum of money B if the 
event occurs. Clearly the bet must be acceptable ("coherent" is the correct 
adjective), i.e. the amount of money A must be smaller or equal to B and 
not negative (who would accept such a bet?). The cases of A = and A = B 
mean that the events are considered to be false or true, respectively, and 
obviously it is not worth betting on certainties. They are just limit cases, 
and in fact they can be treated with standard logic. It seems reasonable^ 
that the amount of money A that one is willing to pay grows linearly with 

^This is not always true in real life. There are also other practical problems related 
to betting which have been treated in the literature. Other variations of the definition 
have also been proposed, like the one based on the penalization rule. A discussion of the 
problem goes beyond the purpose of these notes. 
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the degree of belief. It follows that if someone thinks that the probabihty 
of the event E is p, then he will bet A = pB to get B if the event occurs, 
and to lose pB if it does not. It is easy to demonstrate that the condition of 
"coherence" implies that < p < 1 . 

What has gambling to do with physics? The definition of probability 
through betting odds has to be considered operational, although there is 
no need to make a bet (with whom?) each time one presents a result. It 
has the important role of forcing one to make an honest assessment of the 
value of probability that one believes. One could replace money with other 
forms of gratification or penalization, like the increase or the loss of scientific 
reputation. Moreover, the fact that this operational procedure is not to be 
taken literally should not be surprising. Many physical quantities are defined 
in a similar way. Think, for example, of the text book definition of the electric 
field, and try to use it to measure E in the proximity of an electron. A nice 
example comes from the definition of a poisonous chemical compound: it 
would be lethal if ingested. Clearly it is preferable to keep this operational 
definition at a hypothetical level, even though it is the best definition of the 
concept. 

2.3 Rules of probability 

The subjective definition of probability, together with the condition of co- 
herence, requires that < p < 1. This is one of the rules which probability 
has to obey. It is possible, in fact, to demonstrate that coherence yields to 
the standard rules of probability, generally known as axioms. At this point 
it is worth clarifying the relationship between the axiomatic approach and 
the others: 

• combinatorial and frequentistic "definitions" give useful rules for eval- 
uating probability, although they do not, as it is often claimed, define 
the concept; 

• in the axiomatic approach one refrains from defining what the probabil- 
ity is and how to evaluate it: probability is just any real number which 

satisfies the axioms. It is easy to demonstrate that the probabilities 
evaluated using the combinatorial and the frequentistic prescriptions 
do in fact satisfy the axioms; 

• the subjective approach to probability, together with the coherence 
requirement, defines what probability is and provides the rules which 
its evaluation must obey; these rules turn out to be the same as the 
axioms. 
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Figure 2: Venn diagrams and set properties. 
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Table 1: Events versus sets. 

Since everybody is familiar with the axioms and with the analogy events 
sets (see Tab. 1 and Fig. 2) let us remind ourselves of the rules of probability 
in this form: 

Axiom 1 < P{E) < 1; 

Axiom 2 P{fl) = 1 (a certain event has probability 1); 
Axiom 3 P(Ei U E2) = P(Ei) + P(E2), iiEir]E2^$ 
From the basic rules the following properties can be derived: 
1: P{E) = l-P(Ey, 
2: P(0) = 0; 

3: HACB then P{A) < P{B); 

4: P{A U S) = P{A) + P{B) - P{A n B) . 

We also anticipate here a fifth property which will be discussed in section 3.1: 
5: P{A nB) = P{A\B) ■ P{B) = P{A) ■ P{B\A) . 
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2.4 Subjective probability and "objective" description 
of the physical world 

The subjective definition of probability seems to contradict tlie aim of pliysi- 
cists to describe the laws of Physics in the most objective way (whatever 
this means . . . ). This is one of the reasons why many regard the subjec- 
tive definition of probability with suspicion (but probably the main reason is 
because we have been taught at University that "probability is frequency"). 
The main philosophical difference between this concept of probability and an 
objective definition that "we would have liked" (but which does not exist in 
reality) is the fact that P{E) is not an intrinsic characteristic of the event 
E, but depends on the status of inform,ation available to whoever evaluates 
P{E). The ideal concept of "objective" probability is recovered when every- 
body has the "same" status of information. But even in this case it would 
be better to speak of intersubjective probability. The best way to convince 
ourselves about this aspect of probabihty is to try to ask practical questions 
and to evaluate the probability in specific cases, instead of seeking refuge in 
abstract questions. 1 find, in fact, that, paraphrasing a famous statement 
about Time, "Probability is objective as long as 1 am not asked to evaluate 
it" . Some examples: 

Example 1 : "What is the probability that a molecule of nitrogen at atmo- 
spheric pressure and room temperature has a velocity between 400 and 
500 m/s?". The answer appears easy: "take the Maxwell distribution 
formula from a text book, calculate an integral and get a number. Now 
let us change the question: "/ give you a vessel containing nitrogen and 
a detector capable of measuring the speed of a single molecule and you 
set up the apparatus. Now, what is the probability that the first molecule 
that hits the detector has a velocity between 400 and 500 m/s?". Any- 
body who has minimal experience (direct or indirect) of experiments 
would hesitate before answering. He would study the problem carefully 
and perform preliminary measurements and checks. Finally he would 
probably give not just a single number, but a range of possible num- 
bers compatible with the formulation of the problem. Then he starts 
the experiment and eventually, after 10 measurements, he may form a 
different opinion about the outcome of the eleventh measurement. 

Example 2: "What is the probabihty that the gravitational constant Gn 
has a value between 6.6709 • 10"^^ and 6.6743 ■ 10"^^ m^kg^^s^^?" 
Last year you could have looked at the latest issue of the Particle 
Data Book [8] and answered that the probability was 95 %. Since then 
- as you probably know - three new measurements of Gjv have been 
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CODATA 1986 ("G^") 
PTB (Germany) 1994 
MSL (New Zealand) 1994 



6.6726 ±0.0009 
6.7154 ± 0.0006 
6.6656 ±0.0006 
6.6685 ± 0.0007 



128 
83 
95 
105 



±6.41 ±0.16 
-1.05 ±0.16 
-0.61 ±0.17 



Uni-Wuppcrtal 



(Germany) 1995 



Table 2: Measurement of Gjv (see text). 



performed [9] and we now have four numbers which do not agree with 
each other (see Tab. 2). The probabihty of the true value of Gjv being 
in that range is currently dramatically decreased. 

Example 3: "What is the probability that the mass of the Top quark, or 
that of any of the supersymmetric particles, is below 20 or 50 GeV/ c^?" . 
Currently it looks as if it must be zero. Ten years ago many experiments 
were intensively looking for these particles in those energy ranges. Be- 
cause so many people where searching for them, with enormous human 
and capital investment, it means that, at that time , the probability was 
considered rather high, . . . high enough for fake signals to be reported 
as strong evidence for them^. 

The above examples show how the evaluation of probability is conditioned 
by some a priori ("theoretical") prejudices and by some facts ("experimental 
data"). "Absolute" probabihty makes no sense. Even the classical example 
of probability 1/2 for each of the results in tossing a coin is only acceptable if: 
the coin is regular; it does not remain vertical (not impossible when playing 
on the beach); it does not fall into a manhole; etc. 



The subjective point of view is expressed in a provocative way by de 
Finetti's[5] 



^We will talk later about the influence of a priori beliefs on the outcome of an experi- 
mental investigation. 



PROBABILITY DOES NOT EXIST" . 
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3 Conditional probability and Bayes' theo- 
rem 



3.1 Dependence of the probability from the status of 
information 

If the status of information changes, the evaluation of the probabihty also has 
to be modified. For example most people would agree that the probabihty 
of a car being stolen depends on the model, age and parking site. To take 
an example from physics, the probability that in a HERA detector a charged 
particle of 1 GeV gives a certain number of ADC counts due to the energy 
loss in a gas detector can be evaluated in a very general way - using High 
Energy Physics jargon - making a (huge) Monte Carlo simulation which takes 
into account all possible reactions (weighted with their cross sections), all 
possible backgrounds, changing all physical and detector parameters within 
reasonable ranges, and also taking into account the trigger efficiency. The 
probability changes if one knows that the particle is a K'^: instead of very 
complicated Monte Carlo one can just run a single particle generator. But 
then it changes further if one also knows the exact gas mixture, pressure, 
. . . , up to the latest determination of the pedestal and the temperature of 
the ADC module. 

3.2 Conditional probability 

Although everybody knows the formula of conditional probability, it is useful 
to derive it here. The notation is P{E\H), to be read "probability of E given 
if" , where H stands for hypothesis. This means: the probabihty that E will 
occur if one already knows that H has occurred^. 



'^P{E\H) should not be confused with P{E f] H), "the probabihty that both events 
occur". For example P{E fl H) can be very small, but nevertheless P{E\H) very high: 
think of the limit case 

P{H) = P{H r\H)< P{H\H) = 1 : 

"iJ given i?" is a certain event no matter how small P{H) is, even if P{H) = (in the 
sense of Section 6.2). 
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The event E\H can have three values: 
TRUE: if E is TRUE and is TRUE; 
FALSE: if E is FALSE and is TRUE; 

UNDETERMINED: if H is FALSE; in this case we are merely uninter- 
ested as to what happens to In terms of betting, the bet is invah- 
dated and none loses or gains. 

Then P{E) can be written P(E\Q), to state explicitly that it is the proba- 
bility of E whatever happens to the rest of the world (f2 means all possible 
events). We realize immediately that this condition is really too vague and 
nobody would bet a cent on a such a statement. The reason for usually writ- 
ing P{E) is that many conditions are implicitly - and reasonably - assumed 
in most circumstances. In the classical problems of coins and dice, for exam- 
ple, one assumes that they are regular. In the example of the energy loss, it 
was implicit -"obvious"- that the High Voltage was on (at which voltage?) 
and that HERA was running (under which condition?). But one has to take 
care: many riddles arc based on the fact that one tries to find a solution 
which is valid under more strict conditions than those explicitly stated in 
the question (e.g. many people make bad business deals signing contracts in 
which what "was obvious" was not explicitly stated). 

In order to derive the formula of conditional probability let us assume for 
a moment that it is reasonable to talk about "absolute probability" P{E) — 
P{E\Q), and let us rewrite 

p{Enn) 

P{En(HU H)) 

p {{E n H) u {E nH)) 
P{EnH) + P{EnH) , (1) 

where the result has been achieved through the following steps: 

(a) : E implies Q (i.e. E C Q.) and hence E (1 = E; 

(b) : the complementary events H and H make a finite partition of fl, i.e. 

HUH = n; 

(c) : distributive property; 

(d) : axiom 3. 



P{E) = P{E\Q) = 

a 
b 

c 
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The final result of (1) is very simple: P{E) is equal to the probability that E 
occurs and H also occurs, plus the probability that E occurs but H does not 
occur. To obtain P{E\H) we just get rid of the subset of E which does not 
contain H (i.e. E fl H) and renormalize the probability dividing by P{H), 
assumed to be different from zero. This guarantees that ii E = H then 
P{H\H) — 1. The expression of the conditional probability is finally 

^(^1^) - (PiH) ^ 0) . (2) 

In the most general (and realistic) case, where both E and H are conditioned 
by the occurrence of a third event i^o, the formula becomes 

P{E\H,H.)^^^^^^I^ {P{H\H.)^0). (3) 

Usually we shall make use of (2) (which means — O,) assuming that 
has been properly chosen. We should also remember that (2) can be resolved 
with respect to P{E n H), obtaining the well known 

P(EnH) = P{E\H)P{H) , (4) 

and by symmetry 

P(Er]H) ^ P(H\E)P(E) . (5) 

Two events are called independent if 

P{E r\ H) ^ P{E)P{H) . (6) 

This is equivalent to saying that P{E\H) = P{E) and P{H\E) = P{H), i.e. 
the knowledge that one event has occurred does not change the probability 
of the other. If P{E\H) ^ P{E) then the events E and H are correlated. In 
particular: 

• if P(E\H) > P{E) then E and H are positively correlated; 

• if P{E\H) < P{E) then E and H are negatively correlated; 



3.3 B ayes' theorem 

Let us think of all the possible, mutually exclusive, hypotheses Hi which 
could condition the event E. The problem here is the inverse of the previous 
one: what is the probability of under the hypothesis that E has occurred? 
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For example, "what is the probabihty that a charged particle which went in 
a certain direction and has lost between 100 and 120 keV in the detector, is a 
/i, a TT, a or a p?" Our event E is "energy loss between 100 and 120 keV", 
and Hi are the four "particle hypotheses" . This example sketches the basic 
problem for any kind of measurement: having observed an effect, to assess 
the probability of each of the causes which could have produced it. This 
intellectual process is called inference, and it will be discussed after section 9. 

In order to calculate P{Hi\E) let us rewrite the joint probability P{Hi n 
E), making use of (4-5), in two different ways: 



P{Hi\E)P{E) = P{E\Hi)P{Hi) , 



obtaining 



P{Hi\E) 



P{E\H,)P{H,) 
P{E) 



(7) 



(8) 



or 



P{H,\E) 


P{E\H,) 




P{E) ■ 



(9) 



Since the hypotheses Hi are mutually exclusive (i.e. Hi fl Hj — 0, Vi, j) and 
exhaustive (i.e. [j^Hi — D,), E can be written as EUHi, the union of E with 
each of the hypotheses Hi. It follows that 

P{E) = P{EnQ) = p(^Er)[jH^ ^p(^{EnHi)^ 
= Y^P^E^Hi) 

i 

= J2P{EmP{Hi), (10) 



where we have made use of (4) again in the last step. It is then possible to 
rewrite (8) as 



P{H,\E) 



P{E\Hi)P{H,) 
Z,P{E\Hj)P{Hj) 



This is the standard form by which Bayes' theorem is known. (8) and (9) are 
also different ways of writing it. As the denominator of (11) is nothing but 



17 



a normalization factor (such that ^■P{Hi\E) — 1), the formula (11) can be 
written as 



P{Hi\E) (X P{E\Hi)P{Hi) 



;i2) 



Factorizing P{Hi) in (11), and explicitly writing the fact that all the events 
were already conditioned by Ho, we can rewrite the formula as 



P{H,\E,Ho) = aP{H,\Ho) 



with 



a 



P{E\Hi,Ho) 



Z,P{E\Hi,Ho)P{Hi\Ho)' 



(13) 



(14) 



These five ways of rewriting the same formula simply reflect the importance 
that we shall give to this simple theorem. They stress different aspects of 
the same concept: 

• (11) is the standard way of writing it (although some prefer (8)); 

• (9) indicates that P{Hi) is altered by the condition E with the same 
ratio with which P{E) is altered by the condition Hi] 

• (12) is the simplest and the most intuitive way to formulate the the- 
orem: "the probability of given E is proportional to the initial 
probabihty of times the probabihty of E given if/' ; 

• (13-14) show explicitly how the probability of a certain hypothesis is 

updated when the status of information changes: 



P(HAH, 



(also indicated as Po{Hi)) is the initial, or a priori, proba- 
bility (or simply "prior") of Hi, i.e. the probability of this hypoth- 
esis with the status of information available before the knowledge 
that E has occurred; 

(or simply P{Hi\E)) is the final, or "a posteriori" , prob- 
Hi after the new information; 



P{Hi\E,Ho) 



ability o 



P{E\Hi, Ho) (or simply P{E\Hi)) is called likelihood. 



To better understand the terms "initial", "final" and "likelihood", let us 
formulate the problem in a way closer to the physicist's mentality, referring 
to causes and effects: the causes can be all the physical sources which may 
produce a certain observable (the effect). The likehhoods are - as the word 



18 



says - the likelihoods that the effect follows from each of the causes. Using 
our example of the dE/dx measurement again, the causes are all the possi- 
ble charged particles which can pass through the detector; the effect is the 
amount of observed ionization; the likelihoods arc the probabilities that each 
of the particles give that amount of ionization. Notice that in this example 
we have fixed all the other sources of influence: physics process, HERA run- 
ning conditions, gas mixture. High Voltage, track direction, etc.. This is our 
Ho- The problem immediately gets rather complicated (all real cases, apart 
from tossing coins and dice, are complicated!). The real inference would be 
of the kind 

P{Hi\E, H,) oc P{E\H,, H,)P(Hi\H,)P{H,), . (15) 

For each status of (the set of all the possible values of the influence 
parameters) one gets a different result for the final probability^. So, instead 
of getting a single number for the final probability we have a distribution of 
values. This spread will result in a large uncertainty of P{Hi\E). This is 
what every physicist knows: if the calibration constants of the detector and 
the physics process are not under control , the "systematic errors" are large 
and the result is of poor quality. 

3.4 Conventional use of Bayes' theorem 

Bayes' theorem follows directly from the rules of probability, and it can be 
used in any kind of approach. Let us take an example: 

Problem 1: A particle detector has a ii identification efficiency of 95 %, and 
a probability of identifying a tt as a of 2 %. If a particle is identified 
as a then a trigger is issued. Knowing that the particle beam is a 
mixture of 90 % tt and 10 % /i, what is the probability that a trigger is 
really fired by a /x? What is the signal-to-noise {S/N) ratio? 

Solution: The two hypotheses (causes) which could condition the event (ef- 
fect) T (= "trigger fired") are "//" and "tt". They are incompatible 

^Thc symbol oc could be misunderstood if one forgets that the proportionality factor 
depends on all likelihoods and priors (see (13)). This means that, for a given hypothesis 
Hi, as the status of information E changes, P{Hi\E, Ho) may change if P{E\Hi,Ho) 
and P{Hi\Ho) remain constant, if some of the other likelihoods get modified by the new 
information. 
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(clearly) and exhaustive (90%+10%=100%). Then: 



"^^^''^ P{T\^)P.{^) + P{T\n)P.{^) ^^'^ 
0.95 X 0.1 

= 0.84 , 



0.95 X 0.1 + 0.02 X 0.9 

and P(7r|T) = 0.16. 

The signal to noise ratio is P(/x|T)/P(7r|T) = 5.3. It is interesting to 
rewrite the general expression of the signal to noise ratio if the effect 
E is observed as 

, _ P{S\E) _ P{E\S) PojS) 
' P{N\E) P{E\N) P,{N) ■ ^ ^ 

This formula explicitly shows that when there are noisy conditions 

Po{S) < Po{N) 
the experiment must be very selective 

P{E\S) > P{E\N) 
in order to have a decent S/N ratio. 

(How does the S/N change if the particle has to be identified by two 
independent detectors in order to give the trigger? Try it yourself, the 

answer is 5'/iV = 251.) 

Problem 2: Three boxes contain two rings each, but in one of them they 
are both gold, in the second both silver, and in the third one of each 
type. You have the choice of randomly extracting a ring from one of 
the boxes, the content of which is unknown to you. You look at the 
selected ring, and you then have the possibility of extracting a second 
ring, again from any of the three boxes. Let us assume the first ring 
you extract is a gold one. Is it then preferable to extract the second 
one from the same or from a different box? 

Solution: Choosing the same box you have a 2/3 probability of getting a 
second gold ring. (Try to apply the theorem, or help yourself with 
intuition.) 

The difference between the two problems, from the conventional statistics 
point of view, is that the first is only meaningful in the frequentistic ap- 
proach, the second only in the combinatorial one. They are, however, both 
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acceptable from the Bayesian point of view. This is simply because in this 
framework there is no restriction on the definition of probability. In many 
and important cases of life and science, neither of the two conventional defi- 
nitions are applicable. 

3.5 Bayesian statistics: learning by experience 

The advantage of the Bayesian approach (leaving aside the "little philosoph- 
ical detail" of trying to define what probability is) is that one may talk about 
the probability of any kind of event, as already emphasized. Moreover, the 
procedure of updating the probability with increasing information is very 
similar to that followed by the mental processes of rational people. Let us 
consider a few examples of "Bayesian use" of Bayes' theorem: 

Example 1: Imagine some persons listening to a common friend having a 
phone conversation with an unknown person Xj, and who are trying 
to guess who Xi is. Depending on the knowledge they have about the 
friend, on the language spoken, on the tone of voice, on the subject of 
conversation, etc., they will attribute some probability to several pos- 
sible persons. As the conversation goes on they begin to consider some 
possible candidates for Xi, discarding others, and eventually fiuctuat- 
ing between two possibilities, until the status of information / is such 
that they are practically sure of the identity of X^. This experience has 
happened to must of us, and it is not difficult to recognize the Bayesian 
scheme: 

P{Xi\I, Q oc P{I\Xi, Io)P{Xi\Q . (18) 

We have put the initial status of information lo explicitly in (18) to 
remind us that likelihoods and initial probabilities depend on it. If 
we know nothing about the person, the final probabilities will be very 
vague, i.e. for many persons Xj the probability will be different from 
zero, without necessarily favoring any particular person. 

Example 2: A person X meets an old friend F in a pub. F proposes that 
the drinks should be payed for by whichever of the two extracts the 
card of lower value from a pack (according to some rule which is of no 
interest to us). X accepts and F wins. This situation happens again 
in the following days and it is always X who has to pay. What is the 
probability that F has become a cheat, as the number of consecutive 
wins n increases? 
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The two hypotheses are: cheat (C) and honest {H). -Po(C) is low be- 
cause F is an "old friend", but certainly not zero (you know . . .): let 
us assume 5 %. To make the problem simpler let us make the approxi- 
mation that a cheat always wins (not very clever. . . ): P{Wn\C) = 1). 
The probability of winning if he is honest is, instead, given by the rules 
of probability assuming that the chance of winning at each trial is 1/2 
(" why not?", we shall come back to this point later): P{Wn\H) — 2"". 
The result 



P{Wn\C) ■ Po{C) + P{Wn\H) ■ P,{H) 
l-Po(C) 



l-P,{C) + 2-^-P,{H) 
is shown in the following table: 



(19) 



n 


P{C\Wr,) 


P{H\Wr,) 




(%) 


(%) 





5.0 


95.0 


1 


9.5 


90.5 


2 


17.4 


82.6 


3 


29.4 


70.6 


4 


45.7 


54.3 


5 


62.7 


37.3 


6 


77.1 


22.9 



Naturally, as F continues to win the suspicion of X increases. It is 
important to make two remarks: 

• the answer is always probabilistic. X can never reach absolute 
certainty that F is a cheat, unless he catches F cheating, or F 
confesses to having cheated. This is coherent with the fact that 
we are dealing with random events and with the fact that any 
sequence of outcomes has the same probability (although there is 
only one possibility over 2" in which F is always luckier). Making 
use of P{C\Wn)i X can take a decision about the next action to 
take: 

— continue the game, with probability P{C\Wn) of losing, with 
certainty, the next time too; 
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- refuse to play further, with probabihty P{H\Wn) of offending 
the innocent friend. 

• If Po{C) — the final probabihty will always remain zero: if X 
fully trusts F, then he has just to record the occurrence of a rare 
event when n becomes large. 

To better follow the process of updating the probability when new 
experimental data become available, according to the Bayesian scheme 

"the final probability of the present inference is the initial 
probability of the neoct one", 

let us call P{C\Wn-i) the probability assigned after the previous win. 
The iterative apphcation of the Bayes formula yields: 

PidW.) - Piw\c).picm^.) 



P{W\C) ■ P(C|W^„_i) + P{W\H) ■ P{H\Wn-i) 

l-PiC\Wn-l) 
l-P{C\Wn-l) + l-P{H\Wn-l) ' 



(20) 



where P{W\C) = 1 and P{W\H) = 1/2 arc the probabilities of each 
win. The interesting result is that exactly the same values of P{C\Wn) 
of (19) are obtained (try to believe it!). 

It is also instructive to see the dependence of the final probability on the 
initial probabilities, for a given number of wins n: 



Po{C) 


P{C\Wn) 

(%) 




n — 5 


n= 10 


n= 15 


n = 20 


1% 


24 


91 


99.7 


99.99 


5% 


63 


98 


99.94 


99.998 


50% 


97 


99.90 


99.997 


99.9999 



As the number of experimental observations increases the conclusions no 
longer depend, practically, on the initial assumptions. This is a crucial point 
in the Bayesian scheme and it will be discussed in more detail later. 
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4 Hypothesis test (discrete case) 



Although in conventional statistics books this argument is usually dealt with 
in one of the later chapters, in the Bayesian approach this is so natural that 
it is in fact the first apphcation, as we have seen in the above examples. We 
summarize here the procedure: 

• probabilities are attributed to the different hypotheses using initial prob- 
abilities and experimental data (via the likelihood); 

• the person who makes the inference - or the "user" - will take a decision 
of which he is fully responsible. 

If one needs to compare two hypotheses, as in the example of the signal 
to noise calculation, the ratio of the final probabilities can be taken as a 
quantitative result of the test. Let us rewrite the S/N formula in the most 
general case: 



where again we have reminded ourselves of the existence of Ho- The ratio 
depends on the product of two terms: the ratio of the priors and the ratio of 
the likelihoods. When there is absolutely no reason for choosing between the 
two hypotheses the prior ratio is 1 and the decision depends only on the other 
term, called the Bayes factor. If one firmly believes in either hypothesis, the 
Bayes factor is of minor importance, unless it is zero or infinite (i.e. one and 
only one of the likelihoods is vanishing). Perhaps this is disappointing for 
those who expected objective certainties from a probability theory, but this 
is in the nature of things. 

5 Choice of the initial probabihties (discrete 
case) 

5.1 General criteria 

The dependence of Bayesian inferences on initial probability is pointed to by 
opponents as the fatal flaw in the theory. But this criticism is less severe 
than one might think at first sight. In fact: 

• It is impossible to construct a theory of uncertainty which is not af- 
fected by this "illness" . Those methods which are advertised as being 



P{Hi\E,Ho) 
P{H2\E,Ho) 



P{E\H^,H,) P{H^\H,) 
P{E\H2,Ho) ■ P{H2\H,) ' 



(21) 
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"objective" tend in reality to hide the hypotheses on which they are 
grounded. A typical example is the maximum likelihood method, of 
which we will talk later. 

• as the amount of information increases the dependence on initial prej- 
udices diminishes; 

• when the amount of information is very limited, or completely lacking, 
there is nothing to be ashamed of if the inference is dominated by a 
priori assumptions; 

The fact that conclusions drawn from an experimental result (and sometimes 

even the "result" itself!) often depend on prejudices about the phenomenon 
under study is well known to all experienced physicists. Some examples: 

• when doing quick checks on a device, a single measurement is usually 
performed if the value is "what it should be" , but if it is not then many 
measurements tend to be made; 

• results are sometimes influenced by previous results or by theoreti- 
cal predictions. See for example Fig. 3 taken from the Particle Data 
Book[8]. The interesting book "How experiments end"[10] discusses, 
among others, the issue of when experimentalists are "happy with the 
result" and stop "correcting for the systematics" ; 

• it can happen that slight deviations from the background are inter- 
preted as a signal (e.g. as for the first claim of discovery of the Top 
quark in spring '94), while larger "signals" are viewed with suspicion if 
they are unwanted by the physics "establishment"^; 

• experiments are planned and financed according to the prejudices of 
the moment^; 

These comments are not intended to justify unscrupulous behaviour or sloppy 
analysis. They are intended, instead, to remind us - if need be - that scien- 
tific research is ruled by subjectivity much more than outsiders imagine. The 
transition from subjectivity to "objectivity" begins when there is a large con- 
sensus among the most influential people about how to interpret the results^. 

In this context, the subjective approach to statistical inference at least 
teaches us that every assumption must be stated clearly and all available in- 
formation which could influence conclusions must be weighed with the max- 
imum attempt at objectivity. 

case, concerning the search for electron compositeness in e+e colUsions, is discussed 
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Figure 3: Results on two physical quantities as a function of the publication date. 
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Figure 4: R — aL/crr as a function of the Deep Inelastic Scattering variable x 
as measured by experiments and as predicted by QCD. 
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What are the rules for choosing the "right" initial probabihties? As one 
can imagine, this is an open and debated question among scientists and 
philosophers. My personal point of view is that one should avoid pedantic 
discussion of the matter, because the idea of universally true priors reminds 
me terribly of the famous "angels' sex" debates. 

If I had to give recommendations, they would be: 

• the a priori probability should be chosen in the same spirit as the 
rational person who places a bet, seeking to minimize the risk of losing; 

• general principles - like those that we will discuss in a while - may 
help, but since it may be difficult to apply elegant theoretical ideas in 
all practical situations, in many circumstances the guess of the "expert" 
can be relied on for guidance. 

• avoid using as prior the results of other experiments dealing with the 
same open problem, otherwise correlations between the results would 
prevent all comparison between the experiments and thus the detection 
of any systematic errors. I find that this point is generally overlooked 
by statisticians. 

5.2 Insufficient Reason and Maximum Entropy 

The first and most famous criterion for choosing initial probabilities is the 
simple Principle of Insufficient Reason (or Indifference Principle): if there 
is no reason to prefer one hypothesis over alternatives, simply attribute the 
same probability to all of them. This was stated as a principle by Laplace^ 
in contrast to Leibnitz' famous Principle of Sufficient Reason, which, in sim- 
ple words, states that "nothing happens without a reason". The indifference 
principle applied to coin and die tossing, to card games or to other simple and 
symmetric problems, yields to the well known rule of probability evaluation 
that we have called combinatorial. Since it is impossible not to agree with 

in [11]. 

^For a recent delightful report, see [12]. 

* "A theory needs to be confirmed by experiments. But it is also true that an experi- 
mental result needs to be confirmed by a theory". This sentence expresses clearly - though 
paradoxically - the idea that it is difficult to accept a result which is not rationally jus- 
tified. An example of results "not confirmed by the theory" are the R measurements in 
Deep Inelastic Scattering shown in Fig. 4. Given the conflict in this situation, physicists 
tend to believe more in QCD and use the "low-x" extrapolations ( of what? ) to correct the 
data for the unknown values of R. 

^It may help in understanding Laplace's approach if we consider that he called the 
theory of probability "good sense turned into calculation" . 
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this point of view, in the cases that one judges that it does apply, the com- 
binatorial "definition" of probability is recovered in the Bayesian approach 
if the word "definition" is simply replaced by "evaluation rule" . We have in 
fact already used this reasoning in previous examples. 

A modern and more sophisticated version of the Indifference Principle 
is the Maximum Entropy Principle. The information entropy function of n 
mutually exclusive events, to each of which a probability pi is assigned, is 
defined as 



with K a positive constant. The principle states that "in making inferences 
on the basis of partial information we must use that probability distribu- 
tion which has the maximum entropy subject to whatever is known[13]". 

Notice that, in this case, "entropy" is synonymous with "uncertainty" [13] . 
One can show that, in the case of absolute ignorance about the events Ei, 
the maximization of the information uncertainty, with the constraint that 
Yli=iPi = 1) yields the classical Pi = 1/n (any other result would have been 
worrying. . . ). 

Although this principle is sometimes used in combination with the Bayes' 
formula for inferences (also applied to measurement uncertainty, see [14]), it 
will not be used for applications in these notes. 

6 Random variables 

In the discussion which follows I will assume that the reader is familiar with 
random variables, distributions, probability density functions, expected val- 
ues, as well as with the most frequently used distributions. This section is 
only intended as a summary of concepts and as a presentation of the notation 
used in the subsequent sections. 

6.1 Discrete variables 

Stated simply, to define a random variable X means to find a rule which 
allows a real number to be related univocally (but not biunivocally) to an 
event (E), chosen from those events which constitute a finite partition of 
Q (i.e. the events must be exhaustive and mutually exclusive). One could 
write this expression X{E). If the number of possible events is finite then 
the random variable is discrete, i.e. it can assume only a finite number of 
values. Since the chosen set of events are mutually exclusive, the probability 



n 





i=l 
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ofX — xis the sum of the probabihties of all the events for which X{Ei) — x. 
Note that we shall indicate the variable with X and its numerical realization 
with X, and that, differently from other notations, the symbol x (in place of 

n or k) is also used for discrete variables. 

After this short introduction, here is a list of definitions, properties and 
notations: 

Probability function: 



f{x) = P{X = x) . 



(23) 



It has the following properties: 



1) 
2) 

3) 



< fixi) < 1 ; 
P{X = U X = Xj) 




f{xi) + f{xj) ; 



(24) 
(25) 
(26) 



Cumulative distribution function: 




(27) 



Properties 



1) 
2) 
3) 
4) 



F(-oo) = 
F(+oo) = 1 

F{xi) - F{xi^i) = fix,) 
limF(x + e) — F{x) 



[right side continuity) . (31) 



(28) 
(29) 
(30) 



Expected value (mean): 




(32) 



In general, given a function g{X) of X: 




(33) 



E[-] is a linear operator: 



E[aX + b] = aE[X]+b. 



(34) 
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Variance and standcird deviation: Variance: 

(T^ = Var{X) = E[{X - /i)^] = E[X^] - . (35) 
Standard deviation: 

a^+V^. (36) 

Transformation properties: 

Var(aX + b) = a^Var(X) ; (37) 
a(aX + b) = \a\a(X). (38) 

Binomial distribution: X ~ Bn,p (hereafter "~" stands for "follows"); 
Bn,p stands for binomial with parameters n (integer) and p (real) : 

, r n= 1,2,... ,oo 

[x^O,l,...,n (39) 

Expected value, standard deviation and variation coefficient 

ji ^ np (40) 

a = v^np(l - p) (41) 

= VMiHS^ 1 _ (42) 
jji np \Jn 

1 — p is often indicated by q. 
Poisson distribution: X r~^V\: 

f{x\Vx) = -je ^ 1 . _ n 1 _ • (43) 



a; = 0, 1, . . . , cxD 



(a; is integer, A is real.) 

Expected value, standard deviation and variation coefficient: 

/i = A (44) 
a = \f\ (45) 



Binomial — > Poisson: 



'B^.p > Vx . (47) 

n ^ "oo" 

p ^ "0" 

(A = np) 
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6.2 Continuous variables: probability and density func- 
tion 



Moving from discrete to continuous variables there are the usual problems 
with infinite possibilities, similar to those found in Zeno's "Achilles and the 
tortoise" paradox. In both cases the answer is given by infinitesimal calculus. 
But some comments are needed: 

• the probability of each of the realizations of X is zero {P{X — x) = 0), 
but this does not mean that each value is impossible, otherwise it would 
be impossible to get any result; 

• although all values x have zero probability, one usually assigns differ- 
ent degrees of belief to them, quantified by the probability density 
function f{x). Writing /(.Xi) > f{x2), for example, indicates that our 
degree of belief in xi is greater than that in X2- 

• The probability that a random variable lies inside a finite interval, for 
example P{a < X < b), is instead finite. If the distance between a 
and b becomes infinitesimal, then the probability becomes infinitesimal 
too. If all the values of X have the same degree of belief (and not only 
equal numerical probability P{x) = 0) the infinitesimal probability is 
simply proportional to the infinitesimal interval dP = kdx. In the 
general case the ratio between two infinitesimal probabilities around 
two different points will be equal to the ratio of the degrees of belief in 
the points (this argument implies the continuity of f{x) on either side 
of the values). It follows that dP — f{x)dx and then 



• f{x) has a dimension inverse to that of the random variable. 

After this short introduction, here is a list of definitions, properties and 
notations: 

Cumulative distribution function: 




(48) 




(49) 



or 



dF{x) 



(50) 



dx 



32 



Properties of f{x) and F{x): 

• fix) > ; 

• I-^ fi^)dx = 1 ; 

• < F{x) < 1; 

. P{a<X<b) = jlf{x)dx = f{x)dx - f{x)dx 
= F{b)-F{ay, 

• if X2 > xi then F{x2) > F{xi) . 

• limj.^_oo F{x) = ; 
lim^^+oo F{x) = 1 ; 

Expected value: 



/ + 00 
xf{x)dx (51) 
-oo 

/ + 00 
9{x)f{x)dx. (52) 
-oo 



Uniform distribution: X ^ IC{a,b): 



f{x\K:{a,b)) = ia<x<b) (53) 

a 

F{x\K:{a,b)) = (54) 
b — a 



Expected value and standard deviation: 



H = (55) 



b — a 



12 

Normal (gaussian) distribution: X ^J\f{ij,,a): 

Iff ( < jl < +00 

f{x\N{n,a))^ -^e-^-^ { {)<a<oo 

v27r(T 



(56) 



-oo < X < +00 (57) 

where ji and a (both real) are the expected value and standard devia- 
tion, respectively. 



^°The symbols of the following distributions have the parameters within parentheses to 
indicate that the variables are continuous. 
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Standeird normal distribution: the particular normal distribution of mean 
and standard deviation 1, usually indicated by Z: 

Z~AA(0,1). (58) 
Exponential distribution: T E{t): 

fmr)) = k'A (59) 



0<t <oo 

F{t\S{T)) = (60) 

we use of the symbol t instead of x because this distribution will be 

applied to the time domain. 
Survival probability: 

P{T>t)^l- F{t\£{T)) = e-*/" (61) 

Expected value and standard deviation: 

fx = T (62) 
a = T. (63) 

The real parameter r has the physical meaning of lifetime. 

Poisson Exponential: If X {— "number of counts during the time Af) 
is Poisson distributed then T (= "interval of time to be waited - starting 
from any instant - before the first count is recorded") is exponentially 
distributed: 

X^f{x\Vx) ^ T^f{x\£{T)) (64) 
(r = ^) (65) 

6.3 Distribution of several random variables 

We only consider the case of two continuous variables {X and Y). The 
extension to more variables is straightforward. The infinitesimal element of 
probability is dF{x,y) = f{x,y)dxdy, and the probability density function 

d'^F{x,y) 

The probability of finding the variable inside a certain area A is 

J J f{x,y)dxdy. (67) 
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Mcirginal distributions: 



fx{x) 
fviv) 



/+0O 
f{x,y)dy 
-oo 

f{x,y)dx. 



oo 
+00 



(68) 
(69) 



The subscripts X and Y indicate that fx{x) and /y(y) are only function 
oiX and F, respectively (to avoid fooling around with different symbols 
to indicate the generic function). 



Conditional distributions: 

fx{x\y) 

fviylx) 
f{x,y) 



f{x,y) f{x,y) 



friy) J f{x,y)dx 
f{x,y) 
fx{x) 

fx{x\y)fY{y) 
fY{y\x)fY{x) . 



Independent random vEiriables 

f{x,y) = fx{x)fY{y) 

(it implies f{x\y) = fx{x) and f{y\x) = /y(y) .) 
Bayes' theorem for continuous random variables 



me) 



f{e\h)Mh) 
Jf{e\h)Mh)dh- 



Expected value: 

/• /•+00 

IJx^E[X] = xf{x,y)dxdy 
J J —00 

/+00 
xfx{x)dx, 
-00 

and analogously for Y. In general 

n /"+00 

E[g{X,Y)]= / / g{x,y)f{x,y)dxdy. 
J J —00 



(70) 

(71) 

(72) 
(73) 



(74) 



(75) 



(76) 
(77) 



(78) 
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Vciriance: 

al^E[X^]-E\X], (79) 

and analogously for Y. 
Covariance: 

C(yv{X,Y) = E[{X - E[X]){Y - E[Y])] (80) 
= E[XY] - E[X]E[Y] . (81) 

If Y and Y are independent, then E[XY] = E[X]E\Y] and hence 
Cov{X, y) = (the opposite is true only \i X ,Y 



Correlation coeflBcient: 



P(x,y) = , ^""'■^■^'> (82) 



(83) 



Cov{X, Y) 
axCTy 

(-1<P<1) 

Linecir combinations of random Vciriables: 

liY — CjXj, with Cj real, then: 

piY^E[Y] = J2ciE[X,] = J2c^fi, (84) 

i i 

al^Var{Y) = c,Var(Xi) + 2 QCjCo'f;(Xi, X^) (85) 
= J2ciVar{Xi)+J2ciCjCov{Xi,Xj) (86) 

= J^CiCjTij. (89) 

cTy has been written in the different ways, with increasing levels of 
compactness, that can be found in the literature. In particular, (89) 
uses the convention an — and the fact that, by definition, pn — l. 
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Bivariate normal distribution: joint probability density function of X 
and Y with correlation coefficient p (see Fig 5): 



exp 



1 



2(1 - p2) 
Marginal distributions: 

X ~ J\f{iJ,^,aa:) 

Y ~ M{ny,ay). 

Conditional distribution: 



2p 



{x-px){y-Py) , (y-Py) 



(Tx(Ty 



+ 



(91) 
(92) 



f{y\xo) 



2'Kay^J\ - (P- 



exp 



{y- Py + (xo - Px) ) 
2^^(l-p2) 



2"! 



I.e. 



^a;o ~ -A^ + (2^0 - Px) , Oy^l- : 



(93) 



(94) 



the condition X — squeezes the standard deviation and shifts the 
mean of Y . 



7 Central limit theorem 
7.1 Terms and role 

The well known central limit theorem plays a crucial role in statistics and 
justifies the enormous importance that the normal distribution has in many 
practical applications (this is the reason why it appears on 10 DM notes). 

We have reminded ourselves in (84-85) of the expression of the mean and 
variance of a linear combination of random variables 

n 
i=l 

in the most general case, which includes correlated variables (pij 7^ 0). In 
the case of independent variables the variance is given by the simpler, and 
better known, expression 

n 

^^E^^'^^' (p,, = 0, z^j)- (95) 

i=l 
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Figure 6: Central limit theorem at work: the sum of n variables, for two different 
distribution, is shown. The values of n (top-down) are: 1,2,3,5,10,20,50. 
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This is a very general statement, valid for any number and kind of variables 
(with the obvious clause that all cTj must be finite) but it does not give any 
information about the probability distribution of Y. Even if all Xi follow the 
same distributions /(.x), f{y) is different from f{x), with some exceptions, 
one of these being the normal. 

The central limit theorem states that the distribution of a linear combi- 
nation Y will be approximately normal if the variables Xi are independent 
and (Ty is much larger than any single component cjaf from a non-normally 
distributed Xj. The last condition is just to guarantee that there is no sin- 
gle random variable which dominates the fluctuations. The accuracy of the 
approximation improves as the number of variables n increases (the theorem 
says "when n — > oo" ) : 



The proof of the theorem can be found in standard text books. For practical 
purposes, and if one is not very interested in the detailed behavior of the 
tails, n equal to 2 or 3 may already gives a satisfactory approximation, es- 
pecially if the Xi exhibits a gaussian-like shape. Look for example at Fig. 6, 
where samples of 10000 events have been simulated starting from a uniform 
distribution and from a crazy square wave distribution. The latter, depicting 
a kind of "worst practical case" , shows that, already forn = 20 the distribu- 
tion of the sum is practically normal. In the case of the uniform distribution 
n = 3 already gives an acceptable approximation as far as probability inter- 
vals of one or two standard deviations from the mean value are concerned. 
The figure also shows that, starting from a triangular distribution (obtained 
in the example from the sum of 2 uniform distributed variables), n = 2 is 
already sufficient (the sum of 2 triangular distributed variables is equivalent 
to the sum of 4 uniform distributed variables). 

7.2 Distribution of a sample average 

As first application of the theorem let us remind ourselves that a sample 
average X„of n independent variables 




(96) 




(97) 



1=1 
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is normally distributed, since it is a linear combination of n variables Xi, 
with Ci — l/n. Then: 

X„ ^ Ar(/xx„,cx^J (98) 



4 



E ^/^ = (99) 

i=l 

" / 1 \ 2 2 



j=i ^ ^ 

This result, wc repeat, is independent of the distribution of X and is already 
approximately valid for small values of n. 

7.3 Normal approximation of the binomial and of the 
Poisson distribution 

Another important application of the theorem is that the binomial and the 
Poisson distribution can be approximated, for "large numbers" , by a normal 
distribution. This is a general result, valid for all distributions which have 
the reproductive property under the sum. Distributions of this kind are the 
binomial, the Poisson and the x^- Let us go into more detail: 



Bn,p A/" (np, y^np{l-p) 



The reproductive property of the binomial states 



that if Xi, X2, . . . , X^ are m independent variables, each following a 
binomial distribution of parameter rii and p, then their sum Y — Xj 
also follows a binomial distribution with parameters n — and p. 

It is easy to be convinced of this property without any mathematics: 
just think of what happens if one tosses bunches of three, of five and 
of ten coins, and then one considers the global result: a binomial with 
a large n can then always be seen as a sum of many binomials with 
smaller rii. The application of the central limit theorem is straight- 
forward, apart from deciding when the convergence is acceptable: the 
parameters on which one has to judge are in this case fi = np and the 
complementary quantity /i'^ = n(l — p) = n — 11. If they are both > 10 
then the approximation starts to be reasonable. 



The same argument holds for the Poisson distribution. In 



this case the approximation starts to be reasonable when = A > 10. 
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7.4 Normal distribution of measurement errors 



The central limit theorem is also important to justify why in many cases 
the distribution followed by the measured values around their average is 
approximately normal. Often, in fact, the random experimental error e, 
which causes the fluctuations of the measured values around the unknown 
true value of the physical quantity, can be seen as an incoherent sum of 
smaller contributions 



each contribution having a distribution which satisfies the conditions of the 
central limit theorem. 

7.5 Caution 

After this commercial in favour of the miraculous properties of the central 
limit theorem, two remarks of caution: 

• sometimes the conditions of the theorem are not satisfied: 

— a single component dominates the fluctuation of the sum: a typical 
case is the well known Landau distribution; also systematic errors 
may have the same effect on the global error; 

— the condition of independence is lost if systematic errors affect a 
set of measurements, or if there is coherent noise; 

• the tails of the distributions do exist and they are not always gaussian! 
Moreover, realizations of a random variable several standard deviations 
away from the mean are possible. And they show up without notice! 

8 Measurement errors and measurement un- 
certainty 

One might assume that the concepts of error and uncertainty are well enough 
known to be not worth discussing. Nevertheless a few comments are needed 
(although for more details to the DIN[1] and IS0[2, 15] recommendations 
should be consulted): 

• the first concerns the terminology. In fact the words error and uncer- 
tainty are currently used almost as synonyms: 




(102) 
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— "error" to mean both error and uncertainty (but nobody says 
"Heisenberg Error Principle"); 

— "uncertainty" only for the uncertainty. 

"Usually" we understand what each other is talking about, but a more 
precise use of these nouns would really help. This is strongly called 
for by the DIN[1] and IS0[2, 15] recommendations. They state in fact 
that 

— error is "the result of a measurement minus a true value of the 
measurand": it follows that the error is usually unkown ; 

— uncertainty is a "parameter, associated with the result of a mea- 
surement, that characterizes the dispersion of the values that could 
reasonably be attributed to the measurand'; 

• Within the High Energy Physics community there is an estabhshed 

practice for reporting the final uncertainty of a measurement in the 
form of standard deviation . This is also recommended by these norms. 
However this should be done at each step of the analysis, instead of 
estimating "maximum error bounds" and using them as standard de- 
viation in the "error propagation" ; 

• the process of measurement is a complex one and it is difficult to dis- 
entangle the different contributions which cause the total error. In 
particular, the active role of the experimentalist is sometimes over- 
looked. For this reason it is often incorrect to quote the ("nominal") 
uncertainty due to the instrument as if it were the uncertainty of the 
measurement. 

9 Statistical Inference 
9.1 Bayesian inference 

In the Bayesian framework the inference is performed calculating the final 
distribution of the random variable associated to the true values of the physi- 
cal quantities from all available information. Let us call x — {xi, X2, ■ ■ ■ , x„} 
the n-tuple ("vector") of observables, = {/Xi, /i2, • • • , /Un} the n-tuple of the 
true values of the physical quantities of interest, and h = {hi, h2, ■ ■ ■ , hn} 
the n-tuple of all the possible realizations of the influence variables Hi. The 
term "influence variable" is used here with an extended meaning, to indi- 
cate not only external factors which could influence the result (temperature. 
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atmospheric pressure, and so on) but also any possible calibration constant 
and any source of systematic errors. In fact the distinction between /i and 
h is artificial, since they are all conditional hypotheses. We separate them 
simply because at the end we will "marginalize" the final joint distribution 
functions with respect to /x, integrating the joint distribution with respect to 
the other hypotheses considered as influence variables. 

The likelihood of the sample x being produced from h and /i and the 
initial probability are 

and 



fo{li,h) = f{ii,h\Ho), (103) 

respectively. Ho is intended to remind us, yet again, that likelihoods and pri- 
ors - and hence conclusions - depend on all explicit and implicit assumptions 
within the problem, and in particular on the parametric functions used to 
model priors and likelihoods. To simplify the formulae. Ho will no longer be 
written explicitly. 

Using the Bayes formula for multidimensional continuous distributions 
(an extension of ( 75)) we obtain the most general formula of inference 

f{x\ii, lijfoilJ', h) 
fil^, h\x) = . - - , (104) 

J f{x\l£,h)fo{i^,h)dfxdh 

yielding the joint distribution of all conditional variables // and h which are 
responsible for the observed sample x. To obtain the final distribution of /i 
one has to integrate (104) over all possible values of h, obtaining 



. I . _ I f{x\[£,h)foi[£,h)dh 
— ~ J f{x\iJ:,h)fo{lJ',h)diJ,dh 



(105) 



Apart from the technical problem of evaluating the integrals, if need be nu- 
merically or using Monte Carlo methods^^, (105) represents the most general 
form of hypothetical inductive inference. The word "hypothetical" reminds 
us of Ho- 

When all the sources of infiuence are under control, i.e. they can be 
assumed to take a precise value, the initial distribution can be factorized by 



^^This is conceptually what experimentalists do when they change all the parameters of 
the Monte Carlo simulation in order to estimate the "systematic error" . 
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a foi/j) and a Dirac S{h — ha), obtaining the much simpler formula 



/ /fel/f, h)fo{[i)S{h- h^)djj,dh 

f{x\iJ^h^)fo{p) 
J f{x\li,h„)fo{^)dii' 



(106) 



Even if formulae (105-106) look complicated because of the multidimensional 
integration and of the continuous nature of ^, conceptually they are identical 
to the example of the dE/dx measurement discussed in Sec. 9.1 

The final probability density function provides the most complete and 
detailed information about the unknown quantities, but sometimes (almost 
always . . .) one is not interested in full knowledge of /(/u), but just in a 
few numbers which summarize at best the position and the width of the 
distribution (for example when publishing the result in a journal in the most 
compact way). The most natural quantities for this purpose are the expected 
value and the variance, or the standard deviation. Then the Bayesian best 
estimate of a physical quantity is: 

fli = E[fj.i\ = J iiif{iJ.\x)dfj. (107) 
al = Varif,i) = E[fi^] - E'[ii,] (108) 

^ (109) 

When many true values are inferred from the same data the numbers 
which synthesize the result are not only the expected values and variances, 
but also the covariances, which give at least the (linear!) correlation coeffi- 
cients between the variables: 

Pij = P(/^i, = ■ (110) 

In the following sections we will deal in most cases with only one value to 
infer: 

/(/x|x) = . . . , (111) 

9.2 Bayesian inference and maximum likelihood 

We have already said that the dependence of the final probabilities on the 
initial ones gets weaker as the amount of experimental information increases. 
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Without going into mathematical comphcations (the proof of this statement 
can be found for example in [3]) this simply means that, asymptotically, what- 
ever foifJ') one puts in (106), /(/u|a;) is unaffected. This is "equivalent" to 
dropping /o(Ai) from (106). This results in 

^(^1^) " r il-l^fl ■ (112) 

Since the denominator of the Bayes formula has the technical role of properly 
normalizing the probability density function, the result can be written in the 
simple form 

f{l^h)o^f{x\iJ,h„) = C{x;ii,h^). (113) 

Asymptotically the final probability is just the (normalized) likelihood! The 
notation C is that used in the maximum likelihood literature (note that, not 
only does / become £, but also "|" has been replaced by C has no 
probabilistic interpretation in conventional statistics.) 

If the mean value of f{fi\x) coincides with the value for which f{fJ^\x) has 
a maximum, we obtain the maximum likelihood method. This does not mean 
that the Bayesian methods are "blessed" because of this achievement, and 
that they can be used only in those cases where they provide the same results. 
It is the other way round, the maximum likelihood method gets justified when 
all the the limiting conditions of the approach (— insensitivity of the result 
from the initial probability — > large number of events) are satisfied. 

Even if in this asymptotic limit the two approaches yield the same nu- 
merical results, there are differences in their interpretation: 

• the likelihood, after proper normalization, has a probabilistic meaning 
for Bayesians but not for frequentists; so Bayesians can say that the 
probability that ^ is in a certain interval is. for example, 68%, while 
this statement is blasphemous for a frequentist ("the true value is a 
constant" from his point of view); 

• frequentists prefer to choose Jil the value which maximizes the likeli- 
hood, as estimator. For Bayesians, on the other hand, the expected 
value /2b = E[ij] (also called the prevision) is more appropriate. This 
is justified by the fact that the assumption of the E[fj] as best estimate 
of II minimizes the risk of a bet (always keep the bet in mind!). For ex- 
ample, if the final distribution is exponential with parameter r (let us 
think for a moment of particle decays) the maximum likelihood method 
would recommend betting on the value t = 0, whereas the Bayesian ap- 
proach suggests the value t — r. If the terms of the bet are "whoever 
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gets closest wins" what is the best strategy? And then, what is the 
best strategy if the terms arc "whoever gets the exact value wins"? 
But now think of the probability of getting the exact value and of the 
probability of getting closest? 

9.3 The dog, the hunter and the biased Bayesian esti- 
mators 

One of the most important tests to judge how good an estimator is, is whether 
or not it is correct (not biased). Maximum likelihood estimators are usually 
correct, while Bayesian estimators - analysed within the maximum likelihood 
framework - often are not. This could be considered a weak point - however 
the Bayes estimators are simply naturally consistent with the status of in- 
formation before new data become available. In the maximum likelihood 
method, on the other hand, it is not clear what the assumptions are. 

Let us take an example which shows the logic of frequentistic inference 
and why the use of reasonable prior distributions yields results which that 
frame classifies as distorted. Imagine meeting a hunting dog in the country. 
Let us assume we know that there is a 50 % probability of finding the dog 
within a radius of 100 meters centered on the position of the hunter (this is 
our likelihood). Where is the hunter? He is with 50% probability within a 
radius of 100 meters around the position of the dog, with equal probability 
in all directions. "Obvious". This is exactly the logic scheme used in the 
frequentistic approach to build confidence regions from the estimator (the 
dog in this example). This however assumes that the hunter can be anywhere 
in the country. But now let us change the status of information: "the dog is 
by a river" ; "the dog has collected a duck and runs in a certain direction" ; 
"the dog is sleeping"; "the dog is in a field surrounded by a fence through 
which he can pass without problems, but the hunter cannot". Given any 
new condition the conclusion changes. Some of the new conditions change 
our likelihood, but some others only infiuence the initial distribution. For 
example, the case of the dog in an enclosure inaccessible to the hunter is 
exactly the problem encountered when measuring a quantity close to the 
edge of its physical region, which is quite common in frontier research. 
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10 Choice of the initial probability density 
function 



10.1 Difference with respect to the discrete case 

The title of this section is similar to that of Sec. 5, but the problem and the 
conclusions will be different. There we said that the Indifference Principle 
(or, in its refined modern version, the Maximum Entropy Principle) was 
a good choice. Here there are problems with infinities and with the fact 
that it is possible to map an infinite number of points contained in a finite 
region onto an infinite number of points contained in a larger or smaller 
finite region. This changes the probability density function. If, moreover, 
the transformation from one set of variables to the other is not linear (see, 
e.g. Fig. 7) what is uniform in one variable (X) is not uniform in another 
variable (e.g. Y = X^). This problem does not exist in the case of discrete 
variables, since if X = Xi has a probability /(xj) then Y — has the same 
probability. A different way of stating the problem is that the Jacobian of 
the transformation squeezes or stretches the metrics, changing the probability 
density function. 

We will not enter into the open discussion about the optimal choice of the 
distribution. Essentially we shall use the uniform distribution, being careful 
to employ the variable which "seems" most appropriate for the problem, but 
You may disagree - surely with good reason - if You have a different kind of 
experiment in mind. 

The same problem is also present, but well hidden, in the maximum 
likehhood method. For example, it is possible to demonstrate that, in the 
case of normally distributed likelihoods, a uniform distribution of the mean 
fi is implicitly assumed (see section 11). There is nothing wrong with this, 
but one should be aware of it. 

10.2 Bertrand paradox and angels' sex 

A good example to help understand the problems outlined in the previous 
section is the so-called Bertrand paradox: 

Problem: Given a circle of radius it! and a chord drawn randomly on it, 
what is the probabihty that the length L of the chord is smaller than 
Rl 

Solution 1: Choose "randomly" two points on the circumference and draw 
a chord between them: ^ P{L < R) — 1/3 — 0.33 . 
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Solution 2: Choose a straight hne passing through the center of the circle; 
then draw a second line, orthogonal to the first, and which intersects 
it inside the circle at a "random" distance from the center: =^ P(L < 
R) = l- V3/2 = 0.13. 

Solution 3: Choose "randomly" a point inside the circle and draw a straight 
line orthogonal to the radius that passes through the chosen point ^ 
P{L <R) = 1/A = 0.25; 

Your solution: ? 

Question: What is the origin of the paradox? 

Answer: The problem does not specify how to "randomly" choose the chord. 
The three solutions take a uniform distribution: along the circumfer- 
ence; along the the radius; inside the area. What is uniform in one 
variable is not uniform in the others! 

Question: Which is the right sohition? 

In principle you may imagine an infinite number of different solutions. From 
a physicist's viewpoint any attempt to answer this question is a waste of time. 
The reason why the paradox has been compared to the Byzantine discussions 
about the sex of angels is that there are indeed people arguing about it. For 
example, there is a school of thought which insists that Solution 2 is the right 
one. 

In fact this kind of paradox, together with abuse of the Indifference Prin- 
ciple for problems like "what is the probability that the sun will rise tomorrow 
morning" threw a shadow over Bayesian methods at the end of last century. 
The maximum likelihood method, which does not make explicit use of prior 
distributions, was then seen as a valid solution to the problem. But in re- 
ality the ambiguity of the proper metrics on which the initial distribution 
is uniform has an equivalent on the arbitrariness of the variable used in the 
likelihood function. In the end, what was criticized when it was stated ex- 
plicitly in the Bayes formula is accepted passively when it is hidden in the 
maximum likelihood method. 
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11 Normally distributed observables 



11.1 Final distribution, prevision and credibility inter- 
vals of the true value 

The first application of the Bayesian inference will be that of a normally 
distributed quantity. Let us take a data sample q of n\ measurements, of 
which we calculate the average q^^. In our formalism is a realization of 
the random variable Q^^- Let us assume we know the standard deviation 
a of the variable either because ni is very large and it can be estimated 
accurately from the sample or because it was known a priori (we are not going 
to discuss in these notes the case of small samples and unknown variance). 
The property of the average (see 7.2) tells us that the likelihood /(Q^J/x, a) 
is gaussian: 

Q„^~A^(/i,a/V^). (114) 

To simplify the following notation, let us call Xi this average and ai the 
standard deviation of the average: 

xi = q^^ (115) 
(Ti = ojyjnl, (116) 

We then apply (106) and get 



(3:1 -m) 



finWMi; ^1)) = ^^"'^^ ■ (117) 

At this point we have to make a choice for /o(/u). A reasonable choice is to 
take, as a first guess, a uniform distribution defined over a "large" interval 
which includes x^. It is not really important how large the interval is, for a 
few (Ti 's away from xi the integrand at the denominator tends to zero because 
of the gaussian function. What is important is that a constant fo{^J) can be 
simplified in (117) obtaining 



mx,M{: a,)) = . (118) 

r°° 1 e ~^^du 

J— CO ^/2^Ta-l 



The integral in the denominator is equal to unity, since integrating with 
respect to //i is equivalent to integrating with respect to Xi. The final result 
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is then 



/(//) = f{ix\x,M{: o,)) = -^e^^ : (119) 

• the true value is normaUy distributed around Xi] 

• its best estimate (prevision) is E[ii] — xi, 

• its variance is a/j, — ai, 

• the "confidence intervals", or credibility intervals, in which there is a 
certain probability of finding the true value are easily calculable: 



Probability level 
(confidence level) 


credibility interval 
(confidence interval) 


68.3 


Xi ± (Ji 


90.0 


xi ± 1.65(Ti 


95.0 


Xi ± 1.96(Ti 


99.0 


± 2.58(Ti 


99.73 


Xl ± 3(Ti 



11.2 Combination of several measurements 

Let us imagine making a second set of measurements of the physical quantity, 
which we assume unchanged from the previous set of measurements. How will 
our knowledge of fi change after this new information? Let us call X2 = q^^ 
and (J2 = cr'/ ^/n2 the new average and standard deviation of the average (a' 
may be different from cr of the sample of numerosity rii). Applying Bayes' 
theorem a second time we now have to use as initial distribution the final 
probability of the previous inference: 



f{fj.\xi,ai,X2,cr2,N) = 



J-^e-'^f(n\x„^^(.,a,))d^^ (120) 



27r(T2 

The integral is not as simple as the previous one, but still feasible analytically. 
The final result is 

/(//|xi,ai,X2,(72,A^) = -=^e"^^ , (121) 
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where 



l/al + l/al 



(122) 



1 



1 1 



(123) 





One recognizes the famous formula of the weighted average with the inverse 
of the variances, usually obtained from maximum likelihood. Some remarks: 

• Bayes' theorem updates the knowledge about fi in an automatic and 
natural way; 

• if (Ti 3> (72 (and Xi is not "too far" from X2) the final result is only 
determined by the second sample of measurements. This suggests that 
an alternative vague a priori distribution can be, instead of the uniform, 
a gaussian with a large enough variance and a reasonable mean; 

• the combination of the samples requires a subjective judgement that 
the two samples are really coming from the same true value We 
will not discuss this point in these notes, but a hint on how to proceed 

is: take the inference on the difference of two measurements, D, as 
explained at the end of Section 13.1 and judge yourself if D = is 
consistent with the probability density function of D. 

11.3 Measurements close to the edge of the physical 



A case which has essentially no solution in the maximum likelihood approach 
is when a measurement is performed at the edge of the physical region and 
the measured value comes out very close to it, or even on the unphysical 
region. Let us take a numeric example: 

Problem: An experiment is planned to measure the (electron) neutrino 
mass. The simulations show that the mass resolution is 3.3eV/c^, 
largely independent of the mass value, and that the measured mass is 
normally distributed around the true mass^^. The mass value which 
results from the elaboration,^^ and corrected for all known systematic 

^^In reality, often what is normally distributed is instead of m. Holding this hypoth- 
esis the terms of the problem change and a new solution should be worked out, following 
the trace indicated in this example. 

^^We consider detector and analysis machinery as a black box, no matter how compli- 
cated it was, and treat the numerical outcome as a result of a direct measurement [1]. 



region 
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effects, is X — — 5.41eV/c^. Wfiat fiave we learned about tfie neutrino 
mass? 



Solution: Our a priori value of the mass is that it is positive and not too 
large (otherwise it would already have been measured in other exper- 
iments). One can take any vague distribution which assigns a prob- 
ability density function between and 20 or 30 eV/c^. In fact, if an 
experiment having a resolution of a = 3.3eV/c^ has been planned and 
financed by rational people, with the hope of finding evidence of non 
negligible mass it means that the mass was thought to be in that range. 
If there is no reason to prefer one of the values in that interval a uniform 
distribution can be used, for example 



= A; = 1/30 (0 < m < 30) 



(124) 



Otherwise, if one thinks there is a greater chance of the mass having 
small rather than high values, a prior which reflects such an assumption 
could be chosen, for example a half normal with ctq = 10 eV 



/ojv(m) = — =- 

or a triangular distribution 

1 



■ exp 



m 



2al 



{m>0), 



/ot(to) 
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(30 -x) (0 < m < 30) . 



(125) 



(126) 



Let us consider for simplicity the uniform distribution 



f{m\x, fox) = 



^^exp 

V 2tt(t 


(m—x)^ 


k 


f30 1 

io V2^.exp 


(m— a;)2 
20-2 


kd/i 



(127) 



19.8 



27r(7 



exp 



(m — x)" 
2^ 



(0 < m < 30) .(128) 



The value which has the highest degree of belief is m = 0, but f{m) 
is non vanishing up to 30eV/c^ (even if very small). We can define an 
interval, starting from m = 0, in which we believe that m should have 
a certain probability. For example this level of probability can be 95 %. 
One has to find the value ruo for which the cumulative function F{mo) 
equals 0.95. This value of m is called the upper limit (or upper hound). 
The result is 



m < 3.9 eV/c^ at 0.95 % C.L. . 



(129) 
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If we had assumed the other initial distributions the hmit would have 
been in both cases 



m < 3.7eV/c^ at 0.95 % C.L. , (130) 

practically the same (especially if compared with the experimental res- 
olution of 3.3eV/c2). 

Comment: Let us assume an a priori function sharply peaked at zero and 
see what happens. For example it could be of the kind 

Mm) oc - . (131) 
m 

To avoid singularities in the integral, let us take a power of m a bit 
greater than —1, for example —0.99, and let us limit its domain to 30, 
getting 

, , , 0.01 •30°°^ 

Mm) = ^0.99 ■ (132) 

The upper hmit becomes 

m < 0.006 eV/c^ at 0.95 % C.L. . (133) 

Any experienced physicist would find this result ridiculous . The upper 
limit is less then 0.2 % of the experimental resolution; hke expecting to 
resolve objects having dimensions smaller than a micron with a design 
ruler! Notice instead that in the previous examples the limit was always 
of the order of magnitude of the experimental resolution a. As fos{m) 
becomes more and more peaked at zero (power oi x — > 1) the limit 
gets smaller and smaller. This means that, asymptotically, the degree 
of belief that m = is so high that whatever you measure you will 
conclude that m = 0: you could use the measurement to calibrate 
the apparatus! This means that this choice of initial distribution was 
unreasonable. 



12 Counting experiments 

12.1 Binomially distributed quantities 

Let us assume we have performed n trials and obtained x favorable events. 
What is the probabihty of the next event? This situation happens frequently 



55 




56 



when measuring efficiencies, brancfiing ratios, etc. Stated more generally, one 
tries to infer the "constant and unknown probability" of an event occurring. 

Where we can assume that the probability is constant and the observed 
number of favorable events arc binomially distributed, the unknown quantity 
to be measured is the parameter p of the binomial. Using Bayes' theorem we 
get 



f{p\x,n,B) = 



f{x\Bn,p)fo{p) 

Jo f{x\Bn,p)fo{p)dp 



(n—x)\x\ 



JqP^{1 — p)'^~^dp 



(134) 



where an initial uniform distribution has been assumed. The final distri- 
bution is known to statisticians as (5 distribution since the integral at the 
denominator is the special function called /3, defined also for real values of x 
and n (technically this is a with parameters a = x +1 and h = n — x + 1). 
In our case these two numbers are integer and the integral becomes equal to 
x\{n — x)\/{n + We then get 



f{p\x,n,B) 



(n + 1)! 

x\{n — x)\ 



p^ {I -pr- 



im 



^■^This concept, which is very close to the physicist's mentahty, is not correct from the 
probabiUstic - cognitive - point of view. According to the Bayesian scheme, in fact, the 
probabiHty changes with the new observations. The final inference of j>, however, does not 
depend on the particular sequence yielding x successes over n trials. This can be seen in 
the next table where fn{p) is given as a function of the number of trials n, for the three 
sequences which give 2 successes (indicated by "1") in three trials (the use of (135) is 
anticipated): 

Sequence 



n 


Oil 


101 




110 





1 


1 




1 


1 


2(1 -p) 


2p 




2p 


2 


6p(l -p) 


6p(l- 


P) 


2,p' 


3 


12p2(l-p) 


12p2(l - 


-P) 





This important result, related to the concept of interchangeability, "allows" a physicist who 

is reluctant to give up the concept "unknown constant probability" , to see the problem 
from his point of view, ensuring that the same numerical result is obtained. 
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The expected value and the variance of this distribution are: 

E\p] = ^ (136) 

(n + 3)(n + 2)2 ^ ' 

X + 1 f n X + 1 



n + 2 \n + 2 n + 2j n + 3 
= m(-^-E[p]]^- (138) 



n + z J n 



The value of p for which /(p) has the maximum is instead pm = x/n. The 
expression E\p] gives the prevision of the probability for the (n + l)-th event 
occurring and is called the "recursive Laplace formula" , or "Laplace's rule of 
succession" . 

When X and n become large, and <C x ^ n, /(p) has the following 
asymptotic properties: 

E\p] ^ Pm^-\ (139) 
n 

Varip) ^ -U_-)^_^ Pmii-Pm) . ^^40) 
n \ nJ n n 

P ~ AA(p™,cTp). (142) 

Under these conditions the frequentistic "definition" of probability {x/n) is 
recovered. 

Let us see two particular situations: when x — and x — n. In these 
cases one gives the result as upper or lower limits, respectively. Let us sketch 
the solutions: 

• x=n : 

f{n\Bn,,) = p"; (143) 

f{p\x = n,B) = ^l^ = (n + l)-p"; (144) 
Jo P'^dp 

F{p\x^n,B) = (145) 
To get the 95 % lower bound (limit): 

F{p^\x = n,B) = 0.05, 



P 



= "W05. (146) 
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An increasing number of trials n constrain more and more p around 1. 
• x=0 : 

f(0\Bn,p) = (I-pT; (147) 
/(p|x = 0,n,S) = ^(n+l)-(l-pr; (148) 

F{p\x^O,n,B) = l-(l-p)"+^ (149) 
To get the 95% upper bound (limit): 

F{po\x ^0,n,B) = 0.95; 

Po = 1 - "Vol)5. (150) 

The following table shows the 95 % C.L. limits as a function of n. The 
Poisson approximation, to be discussed in the next section, is also shown. 





Probabihty level = 95 % 


n 


X — n 


X = 




binomial 


binomial 


Poisson approx. 








(Po = 3/n) 


3 


p > 0.47 


p < 0.53 


P < 1 


5 


p > 0.61 


p < 0.39 


p < 0.6 


10 


p > 0.76 


p < 0.24 


p < 0.3 


50 


p > 0.94 


p < 0.057 


p < 0.06 


100 


p > 0.97 


p < 0.029 


p < 0.03 


1000 


p > 0.997 


p < 0.003 


p < 0.003 



To show in this simple case how /(p) is updated by the new information, 
let us imagine we have performed two experiments. The results are Xi = rii 
and X2 = respectively. Obviously the global information is equivalent to 
X — xi-\- X2 and n — n\-\- n2, with x — n. We then get 

f{p\x = n,B) = {n + l)p'* = (m + 712 + l)p"^+''' . (151) 

A different way of proceeding would have been to calculate the final distri- 
bution from the information xi — rii 

/(p|a;i = ni,i3) = (ni + l)p"S (152) 
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and feed it as initial distribution to the next inference: 



/(p|xi = ni,X2 = n2,B) = ,1 ^. , — — (153) 

Jo P v(PFi = ni,B)dp 



10 

p"2(^^ + l)p"i 



1 (154) 
Jo P^'^ini + l)p"idp 

(m + 712 + i)p"^+"' , (155) 



getting the same result. 



12.2 Poisson distributed quantities 

As is well known, the typical application of the Poisson distribution is in 
counting experiments such as source activity, cross sections, etc. The un- 
known parameter to be inferred is A. Applying Bayes formula we get 



Assuming^^ /o(A) constant up to a certain \max ^ ^ a-nd making the integral 
by parts we obtain 

f{\\x,V) = ^ (157) 

(X \ 
n=0 ■ / 

where the last result has been obtained integrating (157) also by parts. Fig. 9 
shows how to build the credibility intervals, given a certain measured number 
of counts X. Fig. 10 shows some numerical examples. 
/(A) has the following properties: 

• the expected values, variance, value of maximum probability are 

E[\] = x + 1 (159) 
Var{\) = x + 2 (160) 
\m ^ X] (161) 



There is a school of thought according to which the most appropriate function is 
/(A) oc 1/A. If You think that it is reasonable for your problem, it may be a good 
prior. Claiming that this is "the Truth" is one of the many claims of the angels' sex 
determinations. For didactical purposes a uniform distribution is more than enough. 
Some comments about the 1/A prescription will be given when discussing the particular 
case a; = 0. 
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Figure 9: Poisson parameter A inferred from an observed number x of counts. 



the fact that the best estimate of A in the Bayesian sense is not the 
intuitive value x but x + 1 should neither surprise, nor disappoint us: 
according to the initial distribution used "there are always more pos- 
sible values of A on the right side than on the left side of x" , and they 
pull the distribution to their side; the full information is always given 
by /(A) and the use of the mean is just a rough approximation; the 
difference from the "desired" intuitive value x in units of the standard 
deviation goes as l/\/n + 2 and becomes immediately negligible; 

• when X becomes large we get: 

Xm^x; (162) 

Xm^x; (163) 

V^; (164) 

Af{x, Vx) . (165) 

(164) is one of the most familar formulae used by physicists to assess 
the uncertainty of a measurement, although it is sometimes misused. 

Let us conclude with a special case: x — 0. As one might imagine, the 
inference is highly sensitive to the initial distribution. Let us assume that 

the experiment was planned with the hope of observing something, i.e. that 
it could detect a handful of events within its lifetime. With this hypothesis 
one may use any vague prior function not strongly peaked at zero. We have 
already come across a similar case in section 11.3, concerning the upper limit 



E[X\ 
Var{X) 

X 
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A 



Figure 10: Examples of f{X\xi). 

of the neutrino mass. There it was shown that reasonable hypotheses based 
on the positive attitude of the experimentahst are almost equivalent and that 
they give results consistent with detector performances. Let us use then the 
uniform distribution 

f(X\x^O,V) = e-^ (166) 
F(X\x^O,V) = (167) 
A < 3 at 95% C.L.. (168) 



13 Uncertainty due to unknown systematic 
errors 

13.1 Example: uncertainty of the instrument scale off- 
set 

In our scheme any quantity of influence of which we don't know the exact 
value is a source of systematic error. It will change the final distribution of // 
and hence its uncertainty. We have already discussed the most general case 
in 9.1. Let us make a simple application making a small variation to the 
example in section 11.1: the "zero" of the instrument is not known exactly, 
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Figure 11: Upper limit to A having observed events. 



owing to calibration uncertainty. This can be parametrized assuming that 
its true value Z is normally distributed around (i.e. the calibration was 
properly done!) with a standard deviation oz- Since, most probably, the 
true value of \i is independent from the true value of Z, the initial joint 
probability density function can be written as the product of the marginal 
ones: 



/o(/^,^) = fo{l^)Mz) = k 



27raz 



exp 



24, 



(169) 



Also the likehhood changes with respect to 114: 



27r(7i 



exp 



2af 



(170) 



Putting all the pieces together and making use of (105) we finally get 



f{ld\Xi, ... , fo{z)) = — 

Integrating^^ we get 

= /(/^ki, ■■■ ,/o(^)) = 



2^? 



12-KCz 



exp 



2a% 



dz 



2ai 



/1-KCz 



exp 



2a% 



dfidz 



: exp 



(/^-a^i)^ 
■2(af + a|) 



(171) 



16 



It may help to know that 

+00 

exp 



bx — 



dx = va^Trexp 
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The result is that /(//) is still a gaussian, but with a larger variance. The 
global standard uncertainty is the quadratic combination of that due to the 
statistical fluctuation of the data sample and the uncertainty due to the 
imperfect knowledge of the systematic effect: 

crL = '^i + 4 ■ (172) 

This result is well known, although there are still some "old-fashioned" 
recipes which require different combinations of the contributions to be per- 
formed. 

One has to notice that in this framework it makes no sense to speak of 
"statistical" and "systematical" uncertainties, as if they were of a different 
nature. They have the same probabilistic nature: is around /j, with a 
standard deviation ui, and Z is around with standard deviation az- What 
distinguishes the two components is how the knowledge of the uncertainty 
is gained: in one case (cri) from repeated measurements; in the second case 
(az) the evaluation was done by somebody else (the constructor of the in- 
strument), or in a previous experiment, or guessed from the knowledge of 
the detector, or by simulation, etc. This is the reason why the ISO Guide [2] 
prefers the generic names Type A and Type B for the two kinds of contribu- 
tion to global uncertainty. In particular the name "systematic uncertainty" 
should be avoided, while it is correct to speak about "uncertainty due to a 
systematic effect". 

13.2 Correction for known systematic errors 

It is easy to be convinced that if our prior knowledge about Z was of the 
kind 

Zr^N{Zo,(Jz) (173) 

the result would have been 

^x^N{^l- ^o, ^(Ji + (7|) , (174) 

i.e. one has first to correct the result for the best value of the systematic error 
and then include in the global uncertainty a term due to imperfect knowledge 
about it. This is a well known and practised procedure, although there are 
still people who confuse Zo with its uncertainty. 
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13.3 Measuring two quantities with the same instru- 
ment having an uncertainty of the scale offset 

Let us take an example which is a bit more comphcated (at least from the 
mathematical point of view) but conceptually very simple and also very com- 
mon in laboratory practice. We measure two physical quantities with the 
same instrument, assumed to have an uncertainty on the "zero", modeled 
with a normal distribution as in the previous sections. For each of the quan- 
tities we collect a sample of data under the same conditions , which means 
that the unknown offset error does not change from one set of measurements 
to the other. Calling /ii and 112 the true values, Xi and X2 the sample aver- 
ages, o"! and (72 the average's standard deviations, and Z the true value of 
the "zero" , the initial probability density and the likelihood are 



fo{l^l,H2,z) = fo{Hl)fo{H2)fo{z) = k 



1 



27r(T; 



■ exp 



24. 



(175) 



and 

f{Xi,X2\l^Ul^2:Z) 



1 



27r(7i 
1 

27r(Ti(T2 



exp 



exp 



{xi - III- zf 



2(7? 

[xi - fli 



27ra; 



exp 



{x2 - 112- zf 



2al 



zf {X2 - H2- zf 



+ 



(176) 



respectively. The result of the inference is now the joint probability density 
function of /Xi and 112' 



f{lii,li2\xi,X2,ai,a2,fo{z)) 



J fjxi, X2\lJ-u /^2, z)fo{lli, //2, Z)dz 

III /(^i' ^2|a*i, A*2, z)foini, A*2, z)dixidix2 



where expansion of the functions has been omitted for the sake of clarity. 
Integrating we get 



/(/^l,/^2) 



1 



exp 



1 



2(1 -p2) 



(178) 



2p. 



{ill - Xi){lX2 - X2) , {IJ-2-X2) 



+ 



where 



(179) 
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If uz vanishes then (178) has the simpler expression 



/(/^i,/^2) exp 



(/Xl - Xxf 

2al 



■ exp 



{H2 - X2f 

2al 



180) 



i.e. if there is no uncertainty on the offset cahbration the joint density func- 
tion then /(/ii,/i2) is equal to the product of two independent normal func- 
tions, i.e. Ill and 112 are independent. In the general case we have to conclude 
that: 

• the effect of the common uncertainty oz makes the two values correlated , 
since they are affected by a common unknown systematic error; the 
correlation coefficient is always non negative (p > 0), as intuitively 
expected from the definition of systematic error; 

• the joint density function is a multinormal distribution of parameters 
Xl, c^i = y'a"^ -|- cr|, X2, (TjU2 = a/o"! -|- a"|, and p (see example of 
Fig. 5); 

• the marginal distributions are still normal: 

1^1 - Af(^xi,^aj + al^ (181) 

~ (^X2, ^Jai + al^ ; (182) 

• the covariance between ni and //2 is 



• the distribution of any function g{ni,H2) can be calculated using the 
standard methods of probability theory. For example, one can demon- 
strate that the sum S = Hi + 1x2 and the difference D = — 112 are 
also normally distributed (see also the introductory discussion to the 
central limit theorem and section 16 for the calculation of averages and 
standard deviations): 



S ~ N {^i + X2, ^Jal + al + {2azY^ 

D ~ M {^i - X2, \J al + al^ . 
The result can be interpreted in the following way: 



(184) 
(185) 
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— the uncertainty on the difference does not depend on the common 
offset uncertainty: whatever the value of the true "zero" is, it 
cancels in differences; 

— in the sum, instead, the effect of the common uncertainty is some- 
what amplified since it enters "in phase" in the global uncertainty 
of each of the quantities. 



13.4 Indirect calibration 

Let us use the result of the previous section to solve another typical problem 
of measurements. Suppose that after (or before, it doesn't matter) we have 
done the measurements of xi and X2 and we have the final result, summarized 
in (178), we know the "exact" value of /ii (for example we perform the 
measurement on a reference). Let us call it Will this information provide 
a better knowledge of /X2? In principle yes: the difference between xi and 
li\ defines the systematic error (the true value of the "zero" Z). This error 
can then be subtracted from 112 to get a corrected value. Also the overall 
uncertainty of 112 should change, intuitively it "should" decrease, since we 
are adding new information. But its value doesn't seem to be obvious, since 
the logical link between and fi2 is fi^ ^ Z ^ fi2. 

The problem can be solved exactly using the concept of conditional prob- 
ability density function /(/i2|A*i) (see (93-94)). We get 



4 



Ml^l r^Af \X2+ 2 , ^2 (/^l - ^0 

^1 "T '-'Z 




(186) 



The best value of 112 is shifted by an amount A, with respect to the measured 
value X2, which is not exactly xi — /il, as was naively guessed, and the 
uncertainty depends on (T2, <7z and ai. It is easy to be convinced that the 
exact result is more resonable than the (suggested) first guess. Let us rewrite 
A in two different ways: 

A = ^±^(^,l-x,) (187) 

'-'1 "T Oz 



1 



1 1 

-2 • (Xi - //°) + ■ 



(188) 



Eq. (187) shows that one has to apply the correction x\ — [i\ only if 
(Ti = 0. If instead cr^ = there is no correction to be applied, since the 
instrument is perfectly calibrated. If ci oz the correction is half of 
the measured difference between x\ and 
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Eq. (188) shows explicitly what is going on and why the result is 
consistent with the way we have modeled the uncertainties. In fact 
we have performed two independent calibrations: one of the offset and 
one of /ii. The best estimate of the true value of the "zero" Z is the 
weighted average of the two measured offsets; 

the new uncertainty of (sec (186)) is a combination of and the un- 
certainty of the weighted average of the two offsets. Its value is smaller 
than what one would have with only one calibration and, obviously, 
larger than that due to the sampling fluctuations alone: 




'^2<W<Ti + -fi^<y/ai + 4. (189) 

13.5 Counting measurements in presence of background 

As an example of a different kind of systematic effect, let us think of counting 

experiments in the presence of background. For example we are searching 
for a new particle, we make some selection cuts and count x events. But 
we also expect an average number of background events Ab„ ± ctb, where gb 
is the standard uncertainty of A^^, not to be confused with \/\b^- What 
can we say about A5, the true value of the average number associated to 
the signal? First we will treat the case on which the determination of the 
expected number of background events is well known (ub/Ab^ -C 1), and 
then the general case: 



cb/A_Bo ^ 1 '• the true value of the sum of signal and background is A 



As + Ab„. The likelihood is 

e~^A^ 

P{x\\) = -— . (190) 

Applying Bayes's theorem we have 

f(\ \rX)- e-(^^c+^^)(A^„ + As)-/.(A5) 

jg e (^Bo+As)(As„ + Xsrfo{Xs)dXs 

Choosing again fo{)^s) uniform (in a reasonable interval) this gets sim- 
plified. The integral at the denominator can be done easily by parts 
and the final result is: 

f{Xs\x,XB„) = ^ A". — (192) 



As Y^2: (Abo +As)" 



F{Xs\x,XbJ = 1 . (193) 

Z_/n=0 n\ 
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Prom (192-193) it is possible to calculate in the usual way the best 
estimate and the credibility intervals of A5. Two particular cases are 
of interest: 

• if = then formulae (157-158) are recovered. In such a case 
one measured count is enough to claim for a signal (if somebody is 
willing to believe that really Ab„ = without any uncertainty. . . ); 

• if X — then 

/(A|x,AsJ = e-^^ (194) 
independently of A^^. This result is not really obvious. 



Any g{XBo) '• In the general case, the true value of the average number of 
background events A^ is unknown. We only known that it is distributed 
around A^^ with standard deviation as and probability density function 
(/(Ab), not necessarily a gaussian. What changes with respect to the 
previous case is the initial distribution, now a joint function of A5 and 
of Afi. Assuming As and A5 independent the prior density function is 

MXs,XB) = fo{Xs)9o{XB). (195) 

We leave /o in the form of a joint distribution to indicate that the result 
we shall get is the most general for this kind of problem. The likelihood, 
on the other hand, remains the same as in the previous example. The 
inference of Xs is done in the usual way, applying Bayes' theorem and 
marginalizing with respect to Xs' 



... , Ie-'^'-+'^\XB + XsrfoiXs,XB)dXB 



If e-(^B+Xs)^XB + XsrUXs, XB)dXsdXB 



(196) 



The previous case (formula (192)) is recovered if the only value allowed 
for Xb is Xbo and fo{Xs) is uniform: 

/o(As,Ab) = M(Ab-AbJ. (197) 



14 Approximate methods 
14.1 Linearization 

We have seen in the above examples how to use the general formula (105) 
for practical applications. Unfortunately, when the problem becomes more 
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complicated one starts facing integration problems. For this reason approx- 
imate methods are generally used. We will derive the approximation rules 
consistently with the approach followed in these notes and then the resulting 
formulae will be compared with the ISO recommendations. To do this, let us 
neglect for a while all quantities of influence which could produce unknown 
systematic errors. In this case (105) can be replaced by (106), which can 
be further simplified if we remember that correlations between the results 
are originated by unknown systematic errors. In absence of these, the joint 
distribution of all quantities n is simply the product of marginal ones: 

fM^Ylf^M: (198) 

i 

with 

The symbol /r. (//j) indicates that we are dealing with raw values ^'^ evaluated 
ai h — h^. Since for any variation of h the inferred values of Hi will change, 
it is convenient, to name with the same subscript R the quantity obtained 
for h^: 

fR,{f^i)^fR,M. (200) 

Let us indicate with fiR. and 0"^. the best estimates and the standard 
uncertainty of the raw values: 

JiR, = E[iiR,] (201) 
= Variirn,). (202) 

For any possible configuration of conditioning hypotheses h, corrected values 
/li are obtained: 

l^i = fJ^R, + gi{h) . (203) 

The function which relates the corrected value to the raw value and to the 
systematic effects has been denoted by not to be confused with a proba- 
bility density function. Expanding (203) in series around we finally arrive 
at the expression which will allow us to make the approximated evaluations 
of uncertainties: 

(204) 




^^The choice of the adjective "raw" will become clearer in a later on. 
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(All derivatives are evaluated at {JiRijh^}- To simplify the notation a similar 
convention will be used in the following formulae). 

Neglecting the terms of the expansion above the first order, and taking 
the expected values, we get 



cr„. — 



V'Ri ; 

E [(//, - E\ii,]f] 

' dgi 
dhi 



(205) 



a 



Ri 



a 



+2E 



l<m 



dhi 



hi 

dgi 

dhrr 



Plm.O'hiO'hm 



(206) 



Cov{^i, Hj) = E [{^i - E[^ii]){^ij - E[^ij])] 



dh, 



'I 



1+2 E 

L Km 



dgi 
dhi 



dgj 

dhrr 



PlmO'hiO'hm 



(207) 



The terms included within {•} vanish if the unknown systematic errors are 
uncorrelatcd, and the formulae become simpler. Unfortunately, very often 
this is not the when several calibration constants are simultaneously 

obtained from a fit (for example, in most linear fits slop and intercept have 
a correlation coefficient close to —0.9). 

Sometimes the expansion (204) is not performed around the best values 
of h but around their nominal values , in the sense that the correction for the 
known value of the systematic errors has not yet been applied (see section 
13.2). In this case (204) should be replaced by 



Pi 



dhi 



(208) 



where the subscript N stands for nominal. The best value of /li is then 

/ii = Elui] 

^ Pr^ + E 



dhi 



= PRi + ^^Pii- 



(209) 
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(206) and (207) instead remain valid, with the condition that the derivative 
is calculated at h^. If pim = it is possible to rewrite (206) and (207) in the 
following way, which is very convenient for practical applications: 



Cov{ni,Hj) 



I ^ * 

2 \ ^ 2 



'hi 





( % 










^ijl 


dgi 
dhi 


^hi 


% 


^ijl'^il'^jl 







I 
I 



<^hi 



(210) 
(211) 

(212) 

(213) 
(214) 
(215) 



is the component to the standard uncertainty due to effect hi. Siji is equal 
to the prodTict of signs of the derivatives, which takes into account whether 
the uncertainties are positively or negatively correlated. 

To summarize, when systematic effects are not correlated with each other, 
the following quantities are needed to evaluate the corrected result, the com- 
bined uncertainties and the correlations: 

• the raw '^2R^ and cxi^.; 

• the best estimates of the corrections 6ijl^ for each systematic effect hf, 

• the best estimate of the standard deviation due to the imperfect 
knowledge of the systematic effect; 

• for any pair {/Xj, /ij} the sign of the correlation Sij^ due to the effect hi. 

In High Energy Physics applications it is frequently the case that the 
derivatives appearing in (209-213) cannot be calculated directly, as for ex- 
ample when hi arc parameters of a simulation program, or acceptance cuts. 
Then variations of /i. are usually studied varying a particular hi within a rea- 
sonable interval, holding the other influence quantities at the nominal value. 
Snij and calculated from the interval ±A^ of variation of the true 

value for a given variation ±A^^ of hi and from the probabilistic meaning of 



72 



the intervals (i.e. from the assumed distribution of the true value). This em- 
pirical procedure for determining and has the advantage that it can 
take into account non linear effects, since it directly measures the difference 
l^i ~ V-Ri for a given difference hi — hj^^ . 

Some examples are given in section 14.4, and two typical experimental 
applications will be discussed in more detail in section 16. 

14.2 BIPM and ISO recommendations 

In this section we compare the results obtained in the previous section 
with the recommendations of the Bureau International des Poids et Mesures 
(BIPM) and the International Organization for Standardization (ISO) on 
"the expression of experimental uncertainty" . 

1 . The uncertainty in the result of a measurement generally consists 

of several components which may be grouped into two categories 
according to the way in which their numerical value is estimated: 

A: those which are evaluated by statistical methods; 
B: those which are evaluated by other means. 

There is not always a simple correspondence between the classih- 
cation into categories AorB and the previously used classification 
into "random" and "systematic" uncertainties. The term "sys- 
tematic uncertainty" can be misleading and should be avoided. 

The detailed report of the uncertainty should consist of a com- 
plete list of the components, specifying for each the method used 
to obtain its numerical result. 

Essentially the first recommendation states that all uncertainties can 
be treated probabilistically. The distinction between types A and B 
is subtle and can be misleading if one thinks of "statistical methods" 
as synonymous with "probabilistic methods" - as currently done in 
High Energy Physics. Here "statistical" has the classical meaning of 
repeated measurements. 

2. The components in category A are characterized by the estimated 
variances sf (or the estimated "standard deviations" s.j) and the 
number of degrees of freedom f,. Where appropriate, the covari- 
ances should be given. 

The estimated variances correspond to aj^. of the previous section. The 
degrees of freedom have are related to small samples and to the Stu- 
dent t distribution. The problem of small samples is not discussed in 
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these notes, but clearly this recommendation is a relic of frequentistic 
methods. With the approach followed in theses notes there is no need 
to talk about degrees of freedom, since the Bayesian inference defines 
the final probability function /(/x) completely. 

3. The components in category B should be characterized by quan- 
tities Up which may be considered as approximations to the cor- 
responding variances, the existence of which is assumed. The 
quantities uj may the treated Uke variances and the quantities 
Uj hke standard deviations. Where appropriate, the covariances 
should be treated in a similar way. 

Clearly, this recommendation is meaningful only in a Bayesian frame- 
work. 

4. The combined uncertainty should be characterized by the numeri- 
cal value obtained by applying the usual method for the combina- 
tion of variances. The combined uncertainty and its components 
should be expressed in the form of "standard deviations". 

This is what we have found in (206-207). 

5. If, for particular applications, it is necessary to multiply the com- 
bined uncertainty by a factor to obtain an overall uncertainty, the 
multiplying factor used must always stated. 

This last recommendation states once more that the uncertainty is "by 
default" the standard deviation of the true value distribution. Any 
other quantity calculated to obtain a credibility interval with a certain 
probabihty level should be clearly stated. 

To summarize, these are the basic ingredients of the BIPM/ISO recommen- 
dations: 

subjective definition of probability: it allows variances to be assigned 
conceptually to any physical quantity which has an uncertain value; 

uncertainty as standard deviation 

• it is "standard"; 

• the rule of combination (85-88) applies to standard deviations and 
not to confidence intervals; 

combined standard uncertainty: it is obtained by the usual formula of 
"error propagation" and it makes use on variances, covariances and 
first derivatives; 
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central limit theorem: it makes, under proper conditions, the true value 
normally distributed if one has several sources of uncertainty. 

Consultation of the Guzde[2] is recommended for further explanations 
about the justification of the norms, for the description of evaluation proce- 
dures, as well for as examples. I would just like to end this section with some 
examples of the evaluation of type B uncertainties and with some words of 
caution concerning the use of approximations and of linearization. 

14.3 Evaluation of type B uncertainties 

The ISO Guide states that 

For estimate Xi of an input quantity^^ Xi that has not been obtained 
from repeated observations, the . . . standard uncertainty Ui is evalu- 
ated by scientific judgement based on all the available information on 
the possible variability of Xi . The pool of information may include 

• previous measurement data; 

• experience with or general knowledge of the behaviour and prop- 
erties of relevant materials and instruments; 

• manufacturer's specifications; 

• data provided in calibration and other certificates; 

• uncertainties assigned to reference data taken from handbooks. 



14.4 Examples of type B uncertainties 

1. A manufacturer's calibration certificate states that the uncertainty, de- 
fined as k standard deviations , is "± A" : 

A 



2. A result is reported in a publication as a; ± A, stating that the average 
has been performed on 4 measurements and the uncertainty is a 95 % 
confidence interval. One has to conclude that the confidence interval 
has been calculated using the Student t : 

A 



^By "input quantity" the ISO Guide mean any of the contributions hi or fXR. which 
enter into (206-207). 
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3. a manufacturer's specification states that the error on a quantity should 
not exceed A. With this hmited information one has to assume a 
uniform distribution : 

_ 2A _ A 

4. A physical parameter of a Monte Carlo is believed to lie in the interval of 
±A around its best value, but not with uniform distribution: the proba- 
bility that the parameter is at center is higher than than that it is at the 
edges of the interval. With this information a triangular distribution 
can be reasonably assumed: 

A 

u = —=. 
V6 

Note that the coefficient in front of A changes from the 0.58 of the pre- 
vious example to the 0.41 of this. If the interval ±A were a 3(7 interval 

then the coefficient would have been equal to 0.33. These variations - 
to be considered extreme - are smaller than the statistical fluctuations 
of empirical standard deviations estimated from pa 10 measurements. 
This shows that one should not be worried that the type B uncertain- 
ties are less accurate than type A, especially if one tries to model the 
distribution of the physical quantity honestly. 

5. The absolute energy calibration of an electromagnetic calorimeter mod- 
ule is not exactly known and it is estimated to be between the nominal 
one and +10%. The "statistical" resolution is known by test beam 
measurements to be 18%/ E/GeV. What is the uncertainty on the 
energy measurement of an electron which has apparently released 30 
GeV? 

• The energy has to be corrected for the best estimate of the cali- 
bration constant: -|-5 %: 

Er^ 31.5 ±1.0 GeV. 

— assuming a uniform distribution of the true calibration con- 
stant: u = 31.5 X 0.1/v^ = 0.9 GeV: 

^ = 31.5 ±1.3 GeV; 

— assuming a triangular distribution: u — 1.3 GeV: 

= 31.5 ± 1.6 GeV; 
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0.012 
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0.12 
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= 1.45 ±0.14 


A*2 
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(//I = 1.45 ±0.10 ±0.10) 


(/X2 = 1.90 ±0.11 ±0.10) 


(p = ±0.49) 










/i2 + III = 


3.35 ±0.25 














1^2- fJ'l = 


0.45 ±0.15 








71= 1.65 ±0.12 





Table 3: Example of the result of two physical quantities corrected by several 
systematic effects (arbitrary units). 

• interpreting the maximum deviation from the nominal calibration 
as uncertainty (see comment at the end of section 13.2): 

E = 30.0 ± 1.0 ± 3.0 GeV ^E = 30.0 ± 3.2 GeV ; 

As already remarked earlier in these notes, while reasonable as- 
sumptions (in this case the first two) give consistent results, this 
is not true if one makes inconsistent use of the information just 
for the sake of giving "safe" uncertainties. 

6. As a more realistic and slightly more complicated example, let us take 
the case of a measurement of two physical quantities performed with the 
same apparatus. The result, before the correction for systematic effects 
and only with type A uncertainties is fiR^ = 1.50±0.07 and fiR^ = 1.80± 
0.08 (arbitrary units). Let us assume that the measurements depend 
on eight infiuence quantities hi, and that most of them infiuence both 
physical quantities. For simplicity we consider hi to be independent 
from each other. Tab. 3 gives the details of correction for the 
systematic effects and of the uncertainty evaluation, performed using 
(209), (211) and (213). To see the importance of the correlations, the 
result of the sum and of the difference of /xi and /i2 is also reported. 
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In order to split the final result into "individual" and "common" un- 
certainty (between parenthesis in Tab. 3) we have to remember that, 
if the error is additive, the covariance between and is given by 
the variance of the unknown systematic error (see 183). 

The average Jl between /ii and /i2, assuming it has a physical meaning, 
can be evaluated either using the results of section 16, or simply calcu- 
lating the average weighted with the inverse of the variances due to the 
individual uncertainties, and then adding quadratically the common 
uncertainty at the end. Also fx is reported in Tab. 3. 

14.5 Caveat concerning a blind use of approximate 
methods 

The mathematical apparatus of variances and covariances of (206-207) is 

often seen as the most complete description of uncertainty and in most cases 
used blindly in further uncertainty calculations. It must be clear, however, 
that this is just an approximation based on linearization. If the function 
which relates the corrected value to the raw value and the systematic effects 
is not linear then the linearization may cause trouble. An interesting case is 
discussed in section 16. 

There is another problem which may arise from the simultaneous use of 
Bayesian estimators and approximate methods. Let us introduce the problem 
with an example. 

Example 1: 1000 independent measurements of the efficiency of a detector 
have been performed (or 1000 measurements of branching ratio, if you 
prefer). Each measurement was carried out on a base of 100 events and 
each time 10 favorable events were observed (this obviously strange - 
though not impossible - but it simplifies the calculations). The result 
of each measurement will be (see (136-138)): 



Combining the 1000 results using the standard weighted average pro- 
cedure gives 



10 + 1 

100 + 2 



0.1078 



(216) 




(217) 



e = 0.1078 + 0.0010. 



(218) 
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Alternatively, taking the complete set of results to be equivalent to 
100000 trials with 10000 favorable events, the combined result is 

e' = 0.10001 ±0.0009. (219) 

(the same as if one had used the Bayes theorem iteratively to infer /(e) 
from the the partial 1000 results.) The conclusions are in disagreement 
and the first result is clearly mistaken. 

The same problem arises in the case of inference of the Poisson distribution 
parameter A and, in general, whenever /(/x) is not symmetrical around E[ii]. 

Example 2: Imagine an experiment running continuously for one year, search- 
ing for monopoles and identifying none. The consistency with zero can 
be stated either quoting E[X] = 1 and ax = 2, or a 95% upper limit 
A < 3. In terms of rate (number of monopoles per day) the result 
would be either E[r] = 2.7 • lO'^, a{r) = 5.5 • 10"^ or an upper limit 
r < 8.2 • 10~^. It easy to show that, if we take the 365 results for each of 
the running days and combine them using the standard weighted aver- 
age, we get r = 1.0 ± 0.1 monopoles/day! This absurdity is not caused 
by the Bayesian method, but by the standard rules for combining the 
results (the weighted average formulae (122) and (123) are derived from 
the normal distribution hypothesis). Using Bayesian inference would 
have led to a consistent and reasonable result no matter how the 365 
days of running had been subdivided for partial analysis. 

This suggests that in some cases it could be preferable to give the result in 
terms of the value of ^ which maximizes f{fi) {p„i and A„j of sections 12.1 and 
12.2). This way of presenting the results is similar to that suggested by the 
maximum likelihood approach, with the difference that for /(/x) one should 
take the final probability density function and not simply the likelihood. 
Since it is practically impossible to summarize the outcome of an inference 
in only two numbers (best value and uncertainty), a description of the method 
used to evaluate them should be provided, except when f{fi) is approximately 
normally distributed (fortunately this happens most of the time). 

15 Indirect measurements 

Conceptually this is a very simple task in the Bayesian framework, whereas 
the frequentistic one requires a lot of gymnastics going back and forward from 
the logical level of true values to the logical level of estimators. If one accepts 
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that the true values are just random variables^^, then, caUing Y a function 

of other quantities X. each having a probabihty density function f{x), the 
probabihty density function of Y f{y) can be calculated with the standard 
formulae which follow from the rules probability. Note that in the approach 
presented in these notes uncertainties due to systematic effects are treated in 
the same way as indirect measurements. It is worth repeating that there is 
no conceptual distinction between various components of the measurement 
uncertainty. When approximations are sufficient, formulae (206) and (207) 
can be used. 

Let us take an example for which the linearization does not give the right 
result: 

Example: The speed of a proton is measured with a time-of-fiight system. 
Find the 68, 95 and 99 % probability intervals for the energy, knowing 
that (5 — v/c — 0.9971, and that distance and time have been measured 
with a 0.2 % accuracy. 

The relation 

E^— 

is strongly non linear. The results given by the approximated method 
and the correct one are, respectively: 



C.L. 


linearization 


correct 


(%) 


E (GeV) 


E (GeV) 


68 


6.4 < E < 18 


8.8 < E < 64 


95 


0.7 < £; < 24 


7.2 < £; < oo 


99 


0. < £; < 28 


6.6 < £; < oo 



16 Covariance matrix of experimental results 

16.1 Building the covariance matrix of experimental 
data 

In physics applications, it is rarely the case that the covariance between 
the best estimates of two physical quantities'^, each given by the arithmetic 

^^To make the formalism lighter, let us call both the random variable associated to the 
quantity and the quantity itself by the same name Xi (instead of Hxi)- 

^°In this section[16] the symbol will indicate the variable associated to the i-th 
physical quantity and Xik its fc-th direct measurement; Xi the best estimate of its value, 
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average of direct measurements {xi — Xi = ^J2k=i-^i><:), can be evaluated 
from the sample covariance of the two averages 

1 — — 

Cov{xi,xj) = _ J2^Xik - Xi){X^k - Xj) . (220) 

^ ^ k=i 

More frequent is the well understood case in which the physical quantities 
are obtained as a result of a minimization, and the terms of the inverse 
of the covariance matrix are related to the curvature of a-t its minimum: 



1 



2 dXidXj 



(221) 



In most cases one determines independent values of physical quantities 
with the same detector, and the correlation between them originates from 
the detector calibration uncertainties. Frequentistically, the use of (220) in 
this case would correspond to having a "sample of detectors" , with each of 
which a measurement of all the physical quantities is performed. 

A way of building the covariance matrix from the direct measurements 
is to consider the original measurements and the calibration constants as a 
common set of independent and uncorrelated measurements, and then to cal- 
culate corrected values that take into account the calibration constants. The 
variance/covariance propagation will automatically provide the full covari- 
ance matrix of the set of results. Let us derive it for two cases that happen 
frequently, and then proceed to the general case. 



16.1.1 Offset uncertainty 

Let Xi ± ai be the i = 1 . . .n results of independent measurements and Vx 
the (diagonal) covariance matrix. Let us assume that they are all affected 
by the same calibration constant c, having a standard uncertainty ac- The 

corrected results arc then yi = Xi-\- c. We can assume, for simplicity, that the 
most probable value of c is 0, i.e. the detector is well calibrated. One has to 
consider the calibration constant as the physical quantity X^+i, whose best 
estimate is x^+i = 0. A term Vx„+i „+i = cr^ must be added to the covariance 
matrix. 

The covariance matrix of the corrected results is given by the transfor- 
mation: 

Vy = MVxM^ , (222) 

obtained by an average over many direct measurements or indirect measurements, ai the 
standard deviation, and j/j the value corrected for the cahbration constants. The weighted 
average of several Xi will be denoted by x. 
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where Mij = |^ 



The elements of Vy are given by 



dYi 



Xi 3 



In this case we get: 



Cov{Y,,Y^) 



2 I 2 



^ j) 

^2 



1+ 



(223) 

(224) 
(225) 

(226) 
(227) 



reobtaining the results of section 13.3. The total uncertainty on the single 
measurement is given by the combination in quadrature of the individual and 
the common standard uncertainties, and all the covariances are equal to cr^. 
To verify, in a simple case, that the result is reasonable, let us consider only 
two independent quantities Xi and X2, and a calibration constant X3 = c, 
having an expected value equal to zero. From these we can calculate the 
correlated quantities Y\ and Y2 and finally their sum {S = Zi) and difference 
{D = Z2). The results are: 



(Jo 



V; 



af + + 4 ■ a, 



It follows that 



(To 



(2 



a. 



2 ) 



(228) 

(229) 
(230) 

(231) 



(232) 
(233) 



as intuitively expected. 



16.1.2 Normalization uncertainty 

Let us consider now the case where the calibration constant is the scale factor 
/, known with a standard uncertainty ct/. Also in this case, for simplicity 
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and without losing generality, let us suppose that the most probable value of 



/ is 1. Then = /, i.e. Xn+i = 1, and yx„+i,„+i 



Then 



2 I 2 2 



Cov{Y,,Y^) 
Pij 

\Pij\ 



7^ 3) 



1 + 



(TfXi 



1 + 



(TfXj 



(234) 
(235) 

(236) 

(237) 
(238) 



To verify the results let us consider two independent measurements Xi and 
X2, let us calculate the correlated quantities Yi and Y2, and finally their 
product (P = Zi) and their ratio (i? = Z2): 



Vv = 



af + aj- x\ crj ■ xi- X2 



ai ■ xi ■ X2 ai + ai- X2 



I af ■ xl + ■ xl + 4 ■ aj ■ xj ■ 



V; 



It follows that: 



\ 

a\P) 
a\R) 



2 2 xi 

<yt - 1^2 ■ ^ 



(239) 

(240) 
(241) 

(242) 



(tI ■ x\ + ai ■ x\ + {2 ■ Gf ■ xi • X2) 



— + f^2 • — 

^2 -^2 



(243) 
(244) 



Just as an unknown common offset error cancels in differences and is en- 
hanced in sums, an unknown normalization error has a similar effect on the 
ratio and the product. It is also interesting to calculate the standard uncer- 
tainty of a difference in case of a normalization error: 



a^(D) = al + a1 + aj ■ (xi- X2)^ . 



(245) 



The contribution from an unknown normalization error vanishes if the two 
values are equal. 
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16.1.3 General case 



Let us assume there are n independently measured values Xi and m cali- 
bration constants cj with their covariance matrix Vc- The latter can also be 
theoretical parameters influencing the data, and moreover they may be corre- 
lated, as usually happens if, for example, they are parameters of a calibration 
fit. We can then include the cj in the vector that contains the measurements 
and Vc in the covariance matrix Vx: 



X — 



Cl 










... 


\ 











... 































(246) 


v 














\ Cm / 

The corrected quantities are obtained from the most general function 

Yi^Y,{Xi,c) (z = l,2,... (247) 

and the covariance matrix Vy from the covariance propagation Vy = MV^M-^. 

As a frequently encountered example, we can think of several normaliza- 
tion constants, each affecting a subsample of the data - as is the case where 
each of several detectors measures a set of physical quantities. Let us con- 
sider consider just three quantities (X,) and three uncorrelated normalization 
standard uncertainties {(Jf ), the first one common to Xi and X2, the second 
to X2 and X3 and the third to all three. We get the following covariance 
matrix: 



/ ^1 + (4 + 4) • ^1 (4 + 4) • ^1 • ^2 

(4 + 4) • ^1 • ^2 ai + (4 + a% + a%) ■ 



4 • ^1 • ^3 \ 

(4+4) •^2- X3 



V 



(4 + 4) • ^2 • (tI + {a% + al) ■ xl ) 



16.2 Use and misuse of the covariance matrix to fit 
correlated data 

16.2.1 Best estimate of the true value from two correlated values. 

Once the covariance matrix is built one uses it in a fit to get the parameters 
of a function. The quantity to be minimized is x^, defined as 



X^^A^V-^A, 



(248) 
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where A is the vector of the differences between the theoretical and the 
experimental values. Let us consider the simple case in which two results of 
the same physical quantity are available, and the individual and the common 
standard uncertainty are known. The best estimate of the true value of the 
physical quantity is then obtained by fitting the constant Y = k through the 
data points. In this simple case the minimization can be performed easily. 
We will consider the two cases of offset and normalization uncertainty. As 
before, we assume that the detector is well calibrated, i.e. the most probable 
value of the calibration constant is, respectively for the two cases, and 1, 
and hence yi = Xi 



16.2.2 Offset uncertainty 

Let Xi ± (7i and X2 ± a2 be the two measured values, and (Tc the common 
standard uncertainty. The is 

= ^[{xi-kf-{al + al) + {x2-kf-{al + al) (249) 

-2 • (xi - k) ■ {X2 - k) ■ (7,2] , (250) 

where D — a\- a2-\-{(j\-\- (t|) • is the determinant of the covariance matrix. 

Minimizing and using the second derivative calculated at the minimum 
we obtain the best value of k and its standard deviation: 

k = "^-1 + "^"' (=-) (251) 



(7f + ai 



(252) 



a\k) = ^^ + al (253) 



aj + 

The most probable value of the physical quantity is exactly what one obtains 
from the average x weighted with the inverse of the individual variances. 
Its overall uncertainty is the quadratic sum of the standard deviation of the 
weighted average and the common one. The result coincides with the simple 
expectation. 



16.2.3 Normalization uncertainty 

Let Xi ± (Ji and X2 ± (72 be the two measured values, and cr/ the common 
standard uncertainty on the scale. The is 

X' = ^[(x,-kr.{al + xl-a}) + (x2-kr-(al + xl-a}) (254) 
-2 ■ {xi - k) ■ {X2 -k)-Xi-X2- aj] , (255) 
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where D — • + {xl - + x^ - af) • aj . We obtain in this case the following 
result: 



* = _.,^;:^^^^"'t (266) 



al + a^ + {xi - X2y ■ aj 

(257) 

'^^^ ^ al + al + (X, - x^Y . aj " ^'^'^ 

With respect to the previous case, k has a new term {xi — X2)'^ ■ in the 
denominator. As long as this is negligible with respect to the individual 
variances we still get the weighted average x, otherwise a smaller value is 
obtained. Calling r the ratio between k and x, we obtain 

r = - = . (259) 



'1- 



Written in this way, one can see that the deviation from the simple average 
value depends on the compatibility of the two values and on the normalization 
uncertainty. This can be understood in the following way: as soon as the 
two values are in some disagreement, the fit starts to vary the normalization 
factor - in a hidden way - and to squeeze the scale by an amount allowed by 
(T/, in order to minimize the x^- The reason the fit prefers, normalization 
factors smaller than 1 under these conditions lies in the standard formalism of 
the covariance propagation, where only first derivatives are considered. This 
implies that the individual standard deviations are not rescaled by lowering 
the normalization factor, but the points get closer. 

Example 1. Consider the results of two measurements, 8.0 • (1 ± 2%) and 

8.5 ■ (1 ± 2%), having a 10% common normalization error. Assuming 
that the two measurements refer to the same physical quantity, the 
best estimate of its true value can be obtained by fitting the points to 
a constant. Minimizing with V estimated empirically by the data, 
as explained in the previous section, one obtains a value of 7.87 ±0.81, 
which is surprising to say the least, since the most probable result is 
outside the interval determined by the two measured values. 

Example 2. A real hfe case of this strange effect which occurred during 
the global analysis of the R ratio in e~^e~ performed by the CELLO 
collaboration [17], is shown in Fig. 12. The data points represent the 
averages in energy bins of the results of the PETRA and PEP experi- 
ments. They are all correlated and the error bars show the total error 
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Figure 12: R measurements from PETRA and PEP experiments with the best 
fits of QED+QCD to all the data (full line) and only below 36 GeV (dashed line). 
All data points are correlated (see text). 

(sec [17] for details). In particular, at the intermediate stage of the 
analysis shown in the figure, an overall 1 % systematic error due the- 
oretical uncertainties was included in the covariance matrix. The R 
values above 36 GcV show the first hint of the rise of the e'^e~ cross 
section due to the Z° pole. At that time it was very interesting to 
prove that the observation was not just a statistical fluctuation. In 
order to test this, the R measurements were fitted with a theoretical 
function having no Z° contributions, using only data below a certain 
energy. It was expected that a fast increase of per number of degrees 
of freedom u would be observed above 36 GcV, indicating that a the- 
oretical prediction without Z° would be inadequate for describing the 
high energy data. The surprising result was a "repulsion" (see Fig. 12) 
between the experimental data and the fit: including the high energy 
points with larger R a lower curve was obtained, while x^/ u remained 
almost constant. 

To see the source of this effect more explicitly let us consider an alterna- 
tive way often used to take the normalization uncertainty into account. A 
scale factor /, by which all data points are multiplied, is introduced to the 
expression of the x^'- 
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Let us also consider the same expression when the individual standard devi- 
ations are not rescaled: 

Xb = —2 + -2 + — —2 — • (261) 

C*! C'2 

The use of Xa always gives the result k = x, because the term (/ — l)^/cr| is 
harmless^^ as far as the value of the minimum and the determination on 
k are concerned. Its only influence is on a{k), which turns out to be equal 
to quadratic combination of the weighted average standard deviation with 
(7/ -x, the normalization uncertainty on the average. This result corresponds 
to the usual one when the normalization factor in the definition of is not 
included, and the overall uncertainty is added at the end. 

Instead, the use of x% is equivalent to the covariance matrix: the same 
values of the minimum x^, of k and of o"(A;) arc obtained, and / at the min- 
imum turns out to be exactly the r ratio defined above. This demonstrates 
that the effect happens when the data values are rescaled independently of 
their standard uncertainties. The effect can become huge if the data show 
mutual disagreement. The equality of the results obtained with x'b with 
those obtained with the covariance matrix allows us to study, in a simpler 
way, the behaviour of r (= /) when an arbitrary amount of data points are 
analysed. The fitted value of the normalization factor is 

/ = ^^^2 • (263) 

If the values of Xi are consistent with a common true value it can be shown 
that the expected value of / is 

< / >= \. 2 ■ (264) 

Hence, there is a bias on the result when for a non-vanishing u/ a large 
number of data points are fitted. In particular, the fit on average produces a 
bias larger than the normalization uncertainty itself if cij > l/(n — 1). One 
can also see that o"^(A;) and the minimum of x^ obtained with the covariance 
matrix or with Xb smaller by the same factor r than those obtained with 
Xa- 
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This can be seen rewriting (260) as 



For any /, the first two terms determine the value of k, and the third one binds / to 1. 
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16.2.4 Peelle's Pertinent Puzzle 



To summarize, where there is an overall uncertainty due to an unknown sys- 
tematic error and the covariance matrix is used to define x^, the behaviour 
of the fit depends on whether the uncertainty is on the offset or on the scale. 
In the first case the best estimates of the function parameters are exactly 
those obtained without overall uncertainty, and only the parameters' stan- 
dard deviations are affected. In the case of unknown normalization errors, 
biased results can be obtained. The size of the bias depends on the fitted 
function, on the magnitude of the overall uncertainty and on the number of 
data points. 

It has also been shown that this bias comes from the linearization per- 
formed in the usual covariance propagation. This means that, even though 
the use of the covariance matrix can be very useful in analyzing the data in 
a compact way using available computer algorithms, care is required if there 
is one large normalization uncertainty which affects all the data. 

The effect discussed above has also been observed independently by R.W. 
Peelle and reported the year after the analysis of the CELLO data[17]. The 
problem has been extensively discussed among the community of nuclear 
physicists, where it is presently known as "Peelle's Pertinent Puzzle" [18]. 

A recent case in High Energy Physics in which this effect has been found 
to have biased the result is discussed in [19]. 

17 Multi-effect multi-cause inference: unfold- 
ing 

17.1 Problem and typical solutions 

In any experiment the distribution of the measured observables differs from 
that of the corresponding true physical quantities due to physics and detector 
effects. For example, one may be interested in measuring in the variables x 
and Deep Inelastic Scattering events. In such a case one is able to build 
statistical estimators which in principle have a physical meaning similar to 
the true quantities, but which have a non-vanishing variance and are also 
distorted due to QED and QCD radiative corrections, parton fragmentation, 
particle decay and limited detector performances. The aim of the experimen- 
talist is to unfold the observed distribution from all these distortions so as 
to extract the true distribution (see also [20] and [21]). This requires a satis- 
factory knowledge of the overall effect of the distortions on the true physical 
quantity. 
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When dealing with only one physical variable the usual method for han- 
dling this problem is the so called bin-to-bin correction: one evaluates a 
generalized efficiency (it may even be larger than unity) calculating the ratio 
between the number of events falling in a certain bin of the reconstructed 
variable and the number of events in the same bin of the true variable with a 
Monte Carlo simulation. This efficiency is then used to estimate the number 
of true events from the number of events observed in that bin. Clearly this 
method requires the same subdivision in bins of the true and the experimen- 
tal variable and hence it cannot take into account large migrations of events 
from one bin to the others. Moreover it neglects the unavoidable correlations 
between adjacent bins. This approximation is valid only if the amount of mi- 
gration is negligible and if the standard deviation of the smearing is smaller 
than the bin size. 

An attempt to solve the problem of migrations is sometimes made build- 
ing a matrix which connects the number of events generated in one bin to 
the number of events observed in the other bins. This matrix is then in- 
verted and applied to the measured distribution. This immediately produces 
inversion problems if the matrix is singular. On the other hand, there is 
no reason from a probabilistic point of view why the inverse matrix should 
exist. This as can easily be seen taking the example of two bins of the true 
quantity both of which have the same probability of being observed in each 
of the bins of the measured quantity. It follows that treating probability dis- 
tributions as vectors in space is not correct, even in principle. Moreover the 
method is not able to handle large statistical fluctuations even if the matrix 
can be inverted (if we have, for example, a very large number of events with 
which to estimate its elements and we choose the binning in such a way as 
to make the matrix not singular). The easiest way to see this is to think of 
the unavoidable negative terms of the inverse of the matrix which in some 
extreme cases may yield negative numbers of unfolded events. Quite apart 
from these theoretical reservations, the actual experience of those who have 
used this method is rather discouraging, the results being highly unstable. 

17.2 Bayes' theorem stated in terms of causes and ef- 
fects 

Let us state Bayes' theorem in terms of several independent causes {Ci, i — 
1,2,... , nc) which can produce one effect (E). For example, if we consider 
Deep Inelastic Scattering events, the effect E can be the observation of an 
event in a cell of the measured quantities {AQ^^^^, Axmeas}- The causes 
Ci are then all the possible cells of the true values {AQf^^g, Axtrue}i- Let us 
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assume we know the initial probability of the causes P{Ci) and the conditional 
probabiUty that the i-th cause will produce the effect P{E\Ci). The Bayes 
formula is then 

m|B)=^„„p(g|Q).p(Q). (265) 

P{Ci\E) depends on the initial probability of the causes. If one has no better 
prejudice concerning P{Ci) the process of inference can be started from a 
uniform distribution. 

The final distribution depends also on P{E\C.-^. These probabilities must 
be calculated or estimated with Monte Carlo methods. One has to keep in 
mind that, in contrast to P{Ci), these probabilities are not updated by the 
observations. So if there are ambiguities concerning the choice of P{E\Ci) 
one has to try them all in order to evaluate their systematic effects on the 
results. 



17.3 Unfolding an experimental distribution 

If one observes n{E) events with effect E, the expected number of events 
assignable to each of the causes is 

niQ) = niE) ■ P{Ci\E) . (266) 

As the outcome of a measurement one has several possible effects Ej {j = 
1,2,... , He) for a given cause Cj. For each of them the Bayes formula (265) 
holds, and P{Ci\Ej) can be evaluated. Let us write (265) again in the case 
of possible effects^^, indicating the initial probability of the causes with 
PoiQ): 



P{Ej\Ci) ■ PojCj 

j:7^,p{Ej\q)-pM) 



PiQ\E,) = ^.V;^;^V^^o/^^ ■ (267) 



One should note that: 



X^r=i Po{Ci) = 1, as usual. Notice that if the probability of a cause is 
initially set to zero it can never change, i.e. if a cause does not exist it 
cannot be invented; 



^^The broadening of the distribution due to the smearing suggests a choice of larger 
then nc- It is worth remarking that there is no need to reject events where a measured 
quantity has a value outside the range allowed for the physical quantity. For example, 
in the case of Deep Inelastic Scattering events, cells with Xmeas > 1 or Qmeas < give 
information about the true distribution too. 
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• Y^i=i — 1 ■ ^^i^ normalization condition, mathematically 
trivial since it comes directly from (267), indicates that each effect 
must come from one or more of the causes under examination. This 
means that if the observables also contain a non negligible amount of 
background, this needs to be included among the causes; 

• < ej = Yl'j=i-Pi'^j\^i) — 1- there is no need for each cause to 
produce at least one of the effects, gives the efficiency of finding the 
cause Ci in any of the possible effects. 

After Nobs experimental observations one obtains a distribution of fre- 
quencies n{E) = {n(£'i), n(£'2), . . . , n(£'„^)}. The expected number of events 
to be assigned to each of the causes (taking into account only to the observed 
events) can be calculated applying (266) to each effect: 

riE 

n{C!.)Ls = X^n(£;,-)-P(a|i?,). (268) 

When inefficiency^^ is also brought into the picture, the best estimate of the 
true number of events becomes 

n{ ^ ' 



-| nE 

(Ci) = - J]n(E,)-P(a|^,) e,7^ 0. (269) 



Prom these unfolded events we can estimate the true total number of events, 
the final probabilities of the causes and the overall efficiency: 

nc 

Ntrue = ^ n{Ci) 



P{Ci) = P{Ci\nm 



i=l 

n{Q) 

^true 

Nobs 



Ntrue 

If the initial distribution Po{C) is not consistent with the data, it will not 

agree with the final distribution P{C). The closer the initial distribution is 
to the true distribution, the better the agreement is. For simulated data one 
can easily verify for simulated data that the distribution P_{C) lies between 
Po{C) and the true one. This suggests proceeding iteratively. Fig. 13 shows 
an example of a bidimensional distribution unfolding. 

More details about iteration strategy, evaluation of uncertainty, etc. can 
be found in [23]. 1 would just like to comment on an obvious criticism that 

^^If Cj = then n{Ci) will be set to zero, since the experiment is not sensitive to the 
cause Ci. 
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Figure 13: Example of a two dimensional unfolding: true distribution (a), smeared 
distribution (b) and results after the first 4 steps ((c) to (f)). 
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may be made: "^/le iterative procedure is against the Bayesian spirit, since 
the same data arc used many times for the same inference" . In principle the 
objection is valid, but in practice this technique is a "trick" to give to the 
experimental data a weight (an importance) larger than that of the priors. A 
more rigorous procedure which took into account uncertainties and correla- 
tions of the initial distribution would have been much more complicated. An 
attempt of this kind can be found in [22]. Examples of unfolding procedures 
performed with non-Bayesian methods are described in [20] and [21]. 
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18 Conclusions 



These notes have shown how it is possible to build a powerful theory of mea- 
surement uncertainty starting from a definition of probability which seems of 
little use at first sight, and from a formula that - some say - looks too trivial 
to be called a theorem. 

The main advantages the Bayesian approach has over the others are (fur- 
ther to the non negligible fact that it is able to treat problems on which the 
others fail): 

• the recovery of the intuitive idea of probability as a valid concept for 
treating scientific problems; 

• the simplicity and naturalness of the basic tool; 

• the capability of combining prior prejudices and experimental facts; 

• the automatic updating property as soon as new facts are observed; 

• the transparency of the method which allows the different assumptions 
on which the inference may depend to be checked and changed; 

• the high degree of awareness that it gives to its user. 

When employed on the problem of measurement errors, as a special ap- 
plication of conditional probabilities, it allows all possible sources of uncer- 
tainties to be treated in the most general way. 

When the problems get complicated and the general method becomes 
too heavy to handle, it is possible to use approximate methods based on the 
linearization of the final probability density function to calculate the first 
and second moments of the distribution. Although the formulae are exactly 
those of the standard "error propagation", the interpretation of the true 
value as a random variable simplifies the picture and allows easy inclusion of 
uncertainties due to systematic effects. 

Nevertheless there are some cases in which the linearization may cause 
severe problems, as shown in Section 14. In such cases one needs to go back 
to the general method or to apply other kinds of approximations which are 
not just blind use of the covariance matrix. 

The problem of unfolding dealt with in the last section should to be con- 
sidered a side remark with respect to the mainstream of the notes. It is in 
fact a mixture of genuine Bayesian method (the basic formula), approxima- 
tions (covariance matrix evaluation) and ad hoc prescriptions of iteration and 
smoothing, used to sidestep the formidable problem of seeking the general 
solution. 
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I would like to conclude with three quotations: 



• • • ) evaluation of the uncertainty (■ ■ ■ ) must be given 
(...) 

The method stands, therefore, in contrast to certain older meth- 
ods that have the following two ideas in common: 

— The first idea is that the uncertainty reported should be 'safe' 
or 'conservative' (. . . ) In fact, because the evaluation of the 
uncertainty of a measurement result is problematic, it was 
often made deliberately large. 

— The second idea is that the influences that give rise to un- 
certainty were always recognizable as either 'random' or 'sys- 
tematic' with the two being of different nature; (...)" 

(ISO Gmde[2]) 

• "Well, QED is very nice and impressive, but when everything is 
so neatly wrapped up in blue bows, with all experiments in exact 
agreement with each other and with the theory - that is when one 
is learning absolutely nothing. " 

"On the other hand, when experiments are in hopeless conflict - 

or when the observations do not make sense according to conven- 
tional ideas, or when none of the new models seems to work, in 
short when the situation is an unholy mess - that is when one is 
really making hidden progress and a breakthrough is just around 
the corner!" 

(R. Feynman, 1973 Hawaii Summer Institute, cited by D. 
Perkins at 1995 EPS Conference, Brussels) 

• "Although this Guide provides a framework for assessing uncer- 
tainty, it cannot substitute for critical thinking, intellectual hon- 
esty, and professional skill. The evaluation of uncertainty is nei- 
ther a routine task nor a purely mathematical one; it depends 
on detailed knowledge of the nature of the measurand and of the 
measurement. The quality and utility of the uncertainty quoted 
for the result of a measurement therefore ultimately depend on 
the understanding, critical analysis, and integrity of those who 
contribute to the assignment of its value". (ISO Guide[2]) 
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Bibliographic note 

The state of the art of Bayesian theory is summarized in [7], where many 
references can be found. A concise presentation of the basic principles can 
instead been found in [24]. Text books that I have consulted are [4] and 
[6]. They contain many references too. As an introduction to subjective 
probability de Finetti's "Theory of probability "[5] is a must. I have found 
the reading of [25] particular stimulating and that of [26] very convincing 
(thanks expecially to the many examples and exercises) . Unfortunately these 
two books are only available in Italian for the moment. 

Sections 6 and 7 can be reviewed in standard text books. I recommend 
those familiar to you. 

The applied part of these notes, i.e. after section 9, is, in a sense, "orig- 
inal", as it has been derived autonomously and, in many cases, without the 
knowledge that the results have been known to experts for two centuries. 
Some of the examples of section 13 were worked out for these lectures. The 
references in the applied part are given at the appropriate place in the text 
- only those actually used have been indicated. Of particular interest is the 
Weise and Woger theory of uncertainty[14], which differs from that of these 
notes because of the additional use of the Maximum Entropy Principle. 

A consultation of the ISO Guide[2] is advised. Presently the BIPM recom- 
mendations are also followed by the american National Institute of Standards 
and Technology (NIST), whose Guidelines[27] have the advantage, with re- 
spect to the ISO Guide, of being available on www too. 
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